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EVALUATION 


The  increasing  trend  towards  automated  radar  systems  and  "intelligent" 
signal  processing  requires  the  sensor  to  treat  the  environmental  scatter  as 
information  as  well  as  "clutter"  or  interference.  By  inference  from 
measurable  quantities  and  statistics,  the  processor  will  recognize  the 
existence  of  weather,  chaff,  discrete  targets,  statistically  defined 
"homogeneous"  areas,  shadowing  as  opposed  to  specular  reflection,  and 
other  environmental  categories.  This  information  will  allow  the  system 
to  adapt  its  waveform,  energy  budget,  detection/CFAR  and  tracking  algorithms 
for  optimum  performance.  Unfortunately,  while  some  clutter  parameters  can 
be  modeled  as  deterministic  or  as  simple  random  variables  with  excellent 
results,  many  observable  characteristics  appear  to  be  nonstationary,  time- 
varying,  or  otherwise  ill-defined.  The  development  of  "intelligent"  auton¬ 
omous  sensors  requires  an  improved  approach  for  analyzing  and  testing  large 
data  sets  in  support  of  modeling  these  unknown  quantities. 

This  post-doctoral  effort  presents  a  summary  of  pattern  recognition  and 
statistical  decision  theory  and  stresses  the  strengths,  weaknesses,  and 
peculiarities  of  parametric  and  nonparametric  algorithms.  The  effort  pro¬ 
vides  valuable  Insight  into  the  robustness  and  limitations  of  several 
algorithms  and  emphasizes  the  care  required  in  using  these  techniques  for 
data  analysis.  This  effort  supports  the  Air  Force  requirements  as  defined 
in  TPO  4A. 


(I.  , 

WILLIAM  L.  SIMKINS,  JR. 
Project  Engineer 
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1.0  Introduction 


The  application  of  pattern  recognition  techniques  to  radar  pro¬ 
blems  has  been  applied  previously  to  signal  detection  problems.  The 
basic  theories  of  statistical  hypothesis  testing  and  decision  theory 
apply. 

This  paper  is  a  summary  of  the  basic  principles  of  pattern  re¬ 
cognition  and  statistical  decision  theory.  The  effort  has  been  to 
produce  a  brief  exposition  of  the  theory  and  terminology,  with  suf¬ 
ficient  rigor  to  allow  an  understanding  of  the  fundamentals.  Emphasis 
has  been  to  select  references  for  their  lucidity  and  tie  them  together 
to  illuminate  understanding. 

The  second  section  of  the  paper  deals  with  the  fundamentals  of 
statistical  decision  theory.  It  can  be  seen  from  this  exposition  that  the 
terminology  applied  to  radar  detection  is  quite  similar,  if  not  identical 
to,  pattern  recognition  terminology.  To  this  end,  sections  three  and 
four  deal  with  supervised  and  unsupervised  learning  respectively. 

The  fifth  section  contains  a  discussion  of  testing  methods,  and  the 
sixth  is  a  summary  of  test  results  on  simulated  radar  data.  The 
seventh  section  is  a  brief  discussion  of  some  results  using  actual 
radar  data,  while  section  eight  contains  conclusions  and  recommen¬ 
dations.  A  glossary  and  bibliography  are  appended,  together  with 
information  about  the  computer  programs  used  to  implement  the 
various  algorithms  discussed  in  the  report. 
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2.  0  Fundamentals  of  Statistical  Decision  Theor 


The  objectives  of  a  radar  system  are  to:  (a)  detect  the  presence 
of  objects  in  clutter  and  noise,  and  (b)  estimate  their  positions  and 
motions  in  space  relative  to  the  radar.  These  objectives  can  be 
studied  in  terms  of  the  discipline  known  as  'statistical  decision 
theory'.  Reference  (1)  has  an  excellent  discussion  of  statistical 
decision  theory  as  it  applies  to  radar  problems.  The  following 
treatment  is  taken  from  Chapter  8  of  reference  (1). 

2.  1  Detection 

A  radar  echo  is  generally  immersed  in  some  form  of  additive 
noise,  and  also  usually  in  clutter  return.  Since  noise  and  clutter  are 
random  phenomena,  a  decision  must  be  made,  (statistical  in  nature), 
which  concerns  the  presence  or  absence  of  a  target  echo.  We  would 
like  to  minimize  the  number  of  incorrect  decisions.  Consequently,  if 
we  have  a  priori  information  concerning  the  echo  signal,  noise,  and 
clutter  we  can  take  advantage  of  this  in  making  our  decisions. 

Asa  problem  in  hypothesis  testing,  the  detection  of  a  signal  in 
noise  can  be  seen  as  making  a  decision  with  regard  to  a  finite-dura¬ 
tion  sample  of  a  noisy  waveform.  This  sample  may  or  may  not  contain 
a  signal.  Thus,  the  hypothesis  that  the  received  waveform  does  not 
contain  a  signal  is  to  be  tested  against  the  hypothesis  that  the  wave¬ 
form  does  contain  a  signal.  The  first  hypothesis,  denoted  by  H^,  is 
often  called  the  'null'  hypothesis.  The  second  hypothesis,  denoted  by 
Hj  is  referred  to  as  the  'alternative'  hypothesis.  If  the  signal  to  be 
detected  is  deterministic  (i.  e.  ,  its  structure  is  completely  known) 
then  Hj  is  called  a  'simple'  alternative.  In  radar  this  situation 


almost  never  occurs,  since  echo  amplitude  and  phase  are  usually 
unknown.  When  the  signal  to  be  detected  is  a  member  of  a  finite  or 
infinite  set  of  signals,  is  true,  we  can  conclude  only  that  one 
member  of  the  signal  class  is  present  whose  identity  is  not  revealed 
by  the  test. 

Let  us  represent  the  class  of  possible  signals,  (echoes),  as 
vector  points  s  in  signal  space  £1.  Each  point  in  the  space  represents 
a  waveform  with  a  particular  combination  of  signal  parameter  values 
such  as  amplitude,  phase,  doppler,  etc.  When  possible,  a  probability 
of  occurrence  is  assigned  to  each  combination  of  signal  parameters. 

This  information  is  contained  in  a  joint  a  priori  probability  density 
function  o  (s)  over  all  the  points  s  in  signal  space  £7. 

In  a  similar  way  noise  and  clutter  spaces  can  be  defined  whose 
points  n  describe  all  possible  waveform  realizations  of  the  noise  and 
clutter  process  within  the  observation  interval.  From  the  statistical 
and  spectral  properties  of  the  noise  and  clutter,  an  a  priori  joint  pro¬ 
bability  density  p(n)  can  be  deduced  that  describes  the  frequency  of 
occurrence  of  waveforms  in  this  space. 

Next,  an  observation  space,  r ,  is  defined  whose  points  v  re¬ 
present  all  possible  joint  combinations  of  signal  and  noise  waveforms 
within  the  observation  interval.  The  frequency  of  occurrence  of  mem¬ 
bers  of  this  space  can  also  be  described  by  an  a  priori  probability  density 
function  which  is  written  as  a  conditional  probability  p(v  |s)  to  show  the 
explicit  dependence  of  the  observed  waveform  v  in  the  signal  s.  For 
convenience,  we  include  the  null  hypothesis  s  =  0  as  a  point  in  signal 
space  £7. 
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An  essential  feature  of  the  theory  is  the  decision  rule  by  which  a 
decision  is  made.  This  rule  depends  only  on  the  observed  waveform  v 
and  not  on  the  signal  s.  A  decision  rule  leading  to  a  decision  ’d’  as  a 
result  of  the  observation  v  is  denoted  by  D(d  |  v).  D(d  |  v)  describes 
the  conditional  probability  of  deciding  d  having  obersved  v.  Thus,  for 
a  particular  waveform  v  there  is  only  a  probability  that  a  decision  d^  = 
"yes"  or  d^  =  "no"  will  be  made.  Such  a  decision  rule  is  called  a 
'randomized  decision  rule'  and  its  implementation  requires  a  chance 
mechanism  as  part  of  the  receiver  structure.  In  practical  applications, 
the  decision  rule  has  usually  reduced  to  a  'nonrandom  decision  rule' 
where  a  probability  of  0  or  1  is  assigned  to  D(d^  j  v)  and  D(d2|  v)  for 
each  observation  v.  In  this  case  the  receiver  does  not  require  a  chance 
mechanism. 

The  set  of  possible  decisions  d  in  a  statistical  decision  problem 
can  be  described  as  points  in  a  decision  space  A  .  If  the  interpretation 
of  a  decision  rule  D(d  |  v)  as  a  probability  (or  probability  density  if  a 
continuum  of  possible  decisions  is  considered)  is  retained,  then 
D(d  |  v)  describes  the  probability  (density)  of  each  point  in  decision 
space  for  every  possible  waveform  v.  In  a  signal  detection  problem, 
decision  space  for  every  possible  waveform  v.  In  a  signal  detection 
problem,  decision  space  contains  only  two  points:  signal  present  and 
signal  absent. 

% 

Figure  1  shows  the  general  decision  problem  in  terms  of  the  var¬ 
ious  spaces  previously  defined.  A  decision  rule  may  be  interpreted  as 
an  operation  that  maps  points  in  observation  space  into  points  in  decision 
space  with  a  preassigned  probability  D(d  |  v).  The  essence  of  the  decision 
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problem  is  to  choose  decision  rules  that  accomplish  this  mapping  in  an 
optimum  way  with  respect  to  a  chosen  criterion  of  performance.  The 
mathematical  operations  embodied  in  the  decision  rule  define  the  oper¬ 
ations  performed  by  an  "optimum"  decision  receiver  on  the  received 
waveform  v  in  order  to  render  a  decision  d  in  accordance  with  the  se¬ 
lected  criterion. 

2.  2  Parameter  Estimation 

Some  attributes  of  a  radar  target  can  be  deduced  from  modifica¬ 
tions  of  the  reflected  radar  waveform.  These  modifications  are  con¬ 
veniently  characterized  by  unknown  signal  parameters  of  an  otherwise 
deterministic  echo  signal  structure.  Theoretically,  were  it  not  for  the 
presence  of  noise,  the  values  of  these  parameters  could  be  measured 
to  any  desired  degree  of  precision. 

Parameter  estimation  is  formulated  as  a  problem  in  statistical 
decision  theory  by  an  extension  of  the  concept  of  radar  detection.  In 
detection,  observation  space  r  is  mapped  into  two  points  in  decision 
space  A  by  means  of  decision  rule  D(  d  |  v),  namely  signal  present  and 
signal  absent.  If  decision  space  A  is  enlarged  to  include  a  selected  sub¬ 
set  of  points  in  signal  space  f2.  Figure  1  shows  the  parameter  estima¬ 
tion  problem  in  terms  of  decision  theory.  In  fact,  the  set  of  points  in 
decision  space  may  contain  the  entire  set  of  points  in  decision  space. 

In  this  case  the  dimensionalities  of  signal  space  and  decision  space 
are  identical.  Often,  however,  less  precision  is  required,  in  which 
case  the  dimensionality  of  decision  space  is  smaller  than  that  of  signal 
space.  Figure  2  illustrates  two  possible  situations  -  one  in  which  the 
dimensionalities  of  signal  and  decision  space  are  the  same,  and  the 


other,  shown  by  dashed  lines,  in  which  the  dimensionality  of  decision 
space  is  less  than  that  of  signal  space.  A  similar  situation  exists  when 
signal  space  is  of  infinite  dimension. 

In  summary,  parameter  estimation  divides  observation  space 
into  subsets  of  points  that  are  mapped  by  a  decision  rule  into  signal 
points  in  decision  space  A.  Thus,  decision  d^  is  assigned  to  observed 
waveform  v  in  accordance  with  decision  rule  D(d^  |  v  ),  when  v  is  a 
member  of  the  i  th  subset  of  points  in  r  .  As  before,  the  optimum  de¬ 
cision  rule  is  determined  by  the  selected  optimality  criterion. 

Since  in  both  detector  and  parameter  estimation  the  decision  rule 
maps  the  space  of  observations  r  into  the  space  of  decisions  A  ,  the 
simple  detection  problem  is  seen  to  be  merely  a  special  case  of  the 
parameter  estimation  problem  where  all  the  points  in  decision  space 
corresponding  to  signal  present  (s  t  0)  are  grouped  together.  It 
should  be  noted,  however,  that  a  decision  receiver  that  is  optimum  for 
parameter  estimation  may  not  be  optimum  for  detection.  Thus,  it  is 
necessary  to  treat  detection  and  parameter  estimation  as  separate 
statistical  decision  problems. 

2.  3  Loss  Functions 

In  order  to  select  an  optimum  decision  rule  in  a  statistical  decision 
problem,  we  evaluate  the  relative  performance  of  each  possible  decision 
rule,  selecting  the  rule  that  yields  the  "best"  performance.  This  means 
that  a  method  of  evaluating  performance  is  required.  The  concept  of  a 
simple  'cost'  or  'loss  function',  which  associates  a  quantitative  cost 
C(s  ,  d)  with  each  point  s  in  signal  space  n  and  each  point  d  in  decision 
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space  A  was  introduced  by  Wald  (2).  The  cost  function  describes  the 
loss  incurred  by  a  receiving  system  that  results  in  a  decision  d  when 
the  input  signal  is  s.  In  the  case  of  a  correct  decision,  the  loss  in  cost 
function  may  be  interpreted  as  a  gain. 

A  substantial  theory  has  been  developed  for  problems  in  which 
'average  loss'  is  used  as  a  measure  of  comparative  system  performance. 
This  choice  is  motivated  by  the  fact  that  average  loss  is  representative  of 
system  performance  evaluated  over  all  possible  modes  of  behavior.  A 
decision  rule  that  describes  a  receiver  with  the  least  average  loss  is 
called  a  'Bayes  rule',  and  the  receiver  a  'Bayes  receiver'.  Other  per¬ 
formance  criteria  lead  to  different  decision  rules  (e.  g.  ,  minimax, 
Neyman- Pear  son). 

It  is  convenient  to  define  two  loss  functions.  The  'conditional  loss' 
LMD  ]  s)  is  a  useful  measure  of  loss  when  the  input  signal  is  known,  or 
when  the  input  signal  is  not  known  and  the  a  priori  probability  density 
a  (s)  over  signal  space  $7  is  also  unknown.  If  the  a  priori  probability 
density  o(s)  is  known,  a  more  complete  performance  loss  rating  is  pro¬ 
vided  by  the  'average  loss'  L(D,  o  ). 

The  conditional  loss  L  (D  I  s)  is  defined  as  the  mathematical  ex- 

c  1 

pectation  of  the  loss  with  respect  to  all  possible  decisions  d  for  a  given 
s  and  decision  rule  D.  t  hus, 

(la)  Lc(D  |  s)  =  Bj|  -  [C(s,  d)) 

(lb) 


f  C(s,  d)  p(d  |  s)  d  d. 
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Equation  (1)  states  that  the  conditional  loss  is  the  sum 
of  costs  associated  with  all  possible  decisions  weighted 
by  their  probability  of  occurrence,  assuming  that  ¥  is  the 
true  state  of  nature.  The  conditional  probability  of 
deciding  3  given  s,  p(3js),  can  also  be  expressed  by: 

(2)  P(cl|s)  =  fr  p(H,  v|s)  dv. 

The  form  of  equation  (2)  indicates  that  p(cT|?)  can  be 
considered  a  (conditional)  marginal  density  function  that 
can  be  derived  from  the  (conditional)  joint  probability 
density  function  p(<I,  v|  i")  .  By  means  of  the  chain  rule 
for  conditional  probabilities: 

(3)  p(d,  v|  s’)  =  D(cf|v,  s)p(v|  s).  Thus,  equation  (2) 
can  be  expressed  as: 

(4a)  P(3|s)  =  D (cT)  v,  s)p(v|s)dv 

(4b)  = /r  D(3| v)p(v| s)dv. 

Equation  (4b)  used  the  fact: 

(5)  D(cT|v,  s’)  =  D  (3j  v)  ,  since  the  decision  rule  D(<T|v) 
is  only  a  function  of  the  waveform  v,  as  previously  dis¬ 
cussed,  and  is  therefore  dependent  of  s.  Inserting  (4b) 
into  (1)  results  in: 

(6)  Lc(D|s)  =/rp(v|?)dv  /Acjs,  ff)D(ff|v)  dcT. 

When  the  input  signal  is  not  known,  but  the  a  priori 
probability  density  function  £  (s)  is  known  the  average 
loss  L(D,  £  )  is  defined  as  the  mathematical  expectation 
of  the  conditional  loss  with  respect  to  the  input  signal 
statistics.  Thus: 
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(7a)  L(D.  a)  =  E-  [L  (D  j  s)] 

s  c 


(7b) 

a(s)  d  s 

p(v  |  s)  d  v 

r  4 

C(s,  d)  D(d  I  v)  d  d. 


Alternatively,  the  average  loss  can  be  defined  as  the  sum  of  costs 
associated  with  decisions  d  and  inputs  s  weighted  according  to  their  joint 
probability  of  occurrence.  Thus, 


(8a) 

(8b) 


L(D,  a)  =  Ej  -  [C(s,  d)] 


C(s,  d)  p  (d,  s)  d  d  d  s 


(8c) 


a(s)  d  s 


C(s,  d)  p  (d  |  s)  d  d. 

A 


The  inner  integral  in  (8c)  is  the  conditional  loss  defined  in  (lb). 
Therefore,  L(D,  o  )  can  also  be  written  as: 


(9) 


L(D.  o )  = 


L  (D  I  s)  o  (s)  d  s,  which  is  a  restatement  of  (7a). 
c  1 


In  summary,  the  average  loss  function  L(D,  o  )  provides  a  measure 
for  evaluating  the  performance  of  different  systems  when  complete  a 
priori  statistics  concerning  the  signal  and  noise  are  available.  We  will 
next  examine  the  binary  detection  problem. 


2.  4  Binary  Detection 

Binary  detection  involves  making  a  decision  between  two  possible 


outcomes:  Noise  alone  or  signal  plus  noise. 


r 


Let  Hp  denote  the  hypothesis  (null)  that  noise  alone  is  present, 
and  Hj  the  composite  alternative  hypothesis  that  signal  plus  noise  is 
present.  Thus: 

(10)  HQ:  s  e 


H^:  s  ,  where  and  are  nonoverlapping  regions  of 

signal  space.  It  therefore  follows  from  (1)  that  contains  the  single 
point  s  =  0,  and  contains  all  points,  s  0. 

We  can  find  an  expression  for  the  a  priori  probability  density 
o  (s)  defined  over  signal  space  as  follows.  Let  P  and  Q  be  the  a  priori 
probabilities  of  signal  present  and  signal  absent,  respectively.  Then: 


(11)  a{ s)  =  Q  <5  (s  -  0)  +  P  u(s),  where  the  Dirac  delta  function 
(s  -  0)  describes  the  discrete  probability  distribution  of  s  over  S7Q 
and  u(s)  describes  the  probability  density  of  s  over  space  flj.  We  see 
that: 


(12) 


u(s)  d  s  =  l. 


When  (11)  is  substituted  into  (7b),  the  expres¬ 


sion  for  average  loss  L(D,  a  )  may  be  rewritten  as: 


(13)  L(D,  )  =  Q 
+  P 


f  p(v  |  0)  d  v 
r 


w(s)  d  s 


n. 


C(0,  d)  D(d  I  v)  d  d 

A 

f  p(v  |  s)  d  v  C(s,  d)  D(d  |  v)  d  d. 

r  ]a 


Equation  (13)  can  be  simplified  with  the  definition: 

u(s)  p  (v  I  s)  d  s,  to  give: 


(14)  E- [p  (v  |s)]=p(v  |  s>- 
s  s 


n. 
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(15)  L(D,  a  )  =  Q  f  p(v  |  0)  d  v 


C(0,  d)  D(d  |  v)  d  d 


+  P 


p(v  |  s)-  d  v  f  C(s,  d)  D(d  |  v)  d  d. 
s 


Let  cost  assisnments  C(s,  d)  ^.nd  C(0,  d)  be  made  as  shown  in  Table  1, 
where  C  and  C-  denote  costs  of  errors.  C  is  the  penalty  or  cost 
associated  with  deciding  signal  is  present  when,  in  fact,  there  is  no 
signal.  C—  is  the  cost  associated  with  deciding  no  signal  when,  in 

P 

fact,  there  is  a  signal  present.  The  notation  reflects  the  fact  that  « 
is  the  false-alarm  probability,  and  8  is  the  average  missed-detection 
probability.  The  quantities  C^_aand  represent  the  costs  of  cor¬ 

rect  decisions  -  that  is: 


(16a) 


C  =  C(s  e  f2n,  dft) 
1-  a  0  0 


C,  _  =  C(s  e  d,).  These  costs  can  be  carried  through  the 
1-  8  1  1 


(16b) 

remaining  derivations.  However,  since  no  penalty  is  usually  associated 
with  correct  decisions,  it  is  convenient  to  set  the  cost  of  correct  deci¬ 
sions  to  0: 

(17)  C  =  Ct__  «  0. 

1-  a  18 

Table  1  -  Cost  Matrix  for  Binary  Detection 

signal s 

S  =  0  8  +  0 


Decision  d 


'1-a 


6 


Ci  J 
1  -  8 


11 


t 


Substituting  the  cost  matrix  of  Table  1  and  equation  (17)  into  (15) 
gives: 

(18)  L(D,  0  )  =  QC  a  |  D(dx  j  v)  p(v  |  0)  dv 

+  PC  -  '  D(dQ|  v)  p(v  |  s)^  dv  . 

>r 

Equation  (18)  can  be  written  in  another  form.  If  a  denotes  the  probabil¬ 
ity  of  deciding  a  signal  is  present  when  there  is  no  signal  (Type  1  error 
or  false  alarm),  and  B  denotes  the  probability  of  deciding  that  signal  is 
absent  when  it  is  really  present  (Type  II  error  or  missed  detection): 

(19)  c  =  p(v  |  0)  D(d.  |  v)  dv 

r  1  _ 

(20)  J  =  p(v  |  s)^  D(dQ  I  v)  dv  =  ( j  p(v  |  s)  D(dQ  |  v)  dv  -  ) 

=  g  (s  y-  .  Note  that  g"  by  definition  is  the  Type  II  error 
s 

probability  averaged  with  respect  to  the  a  priori  distribution  of  signal. 
Substituting  (19)  and  (20)  into  (18)  gives: 

(21)  L(D,  a)  =  Q  «C  +  P—  Cfer  .  Equation  (21)  relates  average  loss 

C*  p  P 

L(Dt  o)  to  the  a  priori  probability  of  Bignal  P  =  1  -  Q,  the  probabilities 
of  Type  I  and  Type  II  errors,  a  and  ?  ,  and  the  costs  of  Type  I  and 
Type  II  errors,  Ca  and  C -g  ,  respectively. 

2.  5  Bayes 1  Decision  Rule 

Bayes'  decision  rule  D_  results  from  the  minimization  of  L(D,  o  ), 

D 

Since  binary  decision  space  A  contains  only  the  two  points  d^  (no  signal) 
and  d1  (signal  plus  noise),  decision  rule  D  (d  |  v)  satisfies  the  relation: 

(22)  D„(d  |  v)  +  DTJ(d.|  v)  =  1.  Substituting  (22)  into  (18)  and 

D  0  D  1 

eliminating  D^(d^  |  v)  yields: 


(23)  L(D,  a)  =  Q  C  a  + 


°B(d0  I  V)  C?P(V  I  S>i  ’  Q  Cap(v  l0)l  d  v 

_  r 

Note  that  D  (d  )  v)  is  positive  and  less  than  unity.  Also  P,  Q,  C  , 
a  u  u 

C  g'are  positive  quantities.  Then,  to  minimize  L(D,  o  )  choose: 

(24a)  DB(dQ|  v)  =  1 

(24b)  D  (d  I  v)  =  0  that  is,  decide  signal  is  absent  when 
B  1 

(25)  PC  — p(v  Is)-  <  Q  C  p(v  10),  and  choose 

fj  1  S  a  1 

(26a)  DB(dQ  jv)  =  0 

(26b)  D  (d  |  v)  =  1  that  is,  decide  signal  is  present  when: 

B  1 

(27)  P  C— p(v  |  s>-  Q  C  p(v  |  0) 

p  s  ot 


Inequalities  (25)  and  (27)  can  be  rewritten  in  terms  of  a  function 
£  (v),  called  the  'generalized  likelihood  ratio': 

_  Pp(v  |  s>- 

(28)  £  (v)  *  Qp(vj'isf~  ‘  With  this  definition,  the  Bayes'  decision  rule' 
reduces  to: 

(29a)  Decide  d^  when  £  (v)  ^T  (signal  present) 

(29b)  Decide  d^  when  £  (v)  <  T  (signal  absent), 
where:  ^ 

(30)  T  =  ■  ”  —  .  Equation  (29)  specifies  a  test  strategy  in  terms  of 

C  6 

likelihood  ratio  z(v),  which  is  a  function  of  data  v;  and  threshold  T, 
which  is  a  function  of  error  cost  assignments.  The  Bayes  decision  rule 
divides  observation  space  r  into  two  regions  r  '  and  r  "  which  are  se¬ 
parated  by  the  boundary  £(v)  =  T.  The  acceptance  region  r  "  for 
hypothesis  (s  fl^)  contains  all  v  for  which  £(v)  <  T.  The  rejection 
region  for  hypothesis  Hq  contains  all  v  for  which  £  (v)  >_T.  The  rejection 
region  for  hypothesis  HQ  is,  of  course,  the  acceptance  region  for 
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hypothesis  Hj(s  e  u  j). 

When  P,  Q,  u(s),  p(v  I  s),  and  cost  assignments,  C  and  C  -are 

a  8 

known,  the  Bayes  strategy  requires  that  the  generalized  likelihood  ratio 
be  computed  for  received  data  v  and  the  result  compared  with  a  thres¬ 
hold  T,  defined  by  (30).  in  general,  the  computation  of  the  likelihood 
ratio  is  a  complex  nonlinear  operation  on  data  v.  In  radar,  approxi¬ 
mations  for  the  important  cases  of  threshold  signals  and  very  large 
signals  permit  physical  interpretation  of  receiver  structure. 
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2.  6  Error  Probabilities 

Expressions  for  type  I  and  type  II  error  probabilities  a  and  J  , 
respectively,  are  given  by  equations  (19)  and  (20).  These  expressions 
apply,  in  general,  to  both  Bayes  and  non- Bayes  decision  rules  and  are 
not  restricted  to  optimum  systems.  In  statistical  terminology  a  , 
the  probability  of  rejecting  Hq  when,  in  fact,  it  is  true,  is  called  the 
'level'  or  'size'  of  the  test;  1-3  the  probability  of  rejecting  Hq  when, 
in  fact,  it  is  false,  is  called  the  'power'  of  the  test.  In  radar,  1-3  is 
the  probability  of  target  detection. 

Since  observation  space  r  consists  of  nonoverlapping  regions  r  ' 
and  r  ",  we  can  rewrite  equations  (19)  and  (20)  for  a  Bayes  decision 
rule  receiver  as: 


(31) 


a  =|  p(v  |o)DB(dl!  v)dv+|  p(v|o)DB(d1| 


v)  dv 


(32)  g  =  f  p(v  |s>-  D  (d_  I  v)  dv  +  p(v  I  s)- D  (d  I  v)  dv. 

rt  'Stju1  m  s  r>  U  ' 

Note  that  from  earlier  remarks  that: 


(33) 


DB<dii  v,  - 

db  (do I  ■  1  J 


for  v  in  r  '  and 


(34)  Dfi  (dQ  |  v)  =  0 


Db  (dj  |  v)  =  1 


for  v  in  r  ",  so  that 


equations  (31)  and  (32)  simplify  to: 


(35) 


a  - 


(36)  3= 


p(v  I  o)  dv 


r  • 


p(v  |  s)—  dv.  To  illustrate,  consider  a  simple  example 


in  which  signal  space  contains  a  single  member  s  =  s,  and  a  single 
observation  v  is  made.  In  this  case,  observation  space  r  may  be 
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represented  by  the  real  line  -  »  <_  v  <  ®  .  Probability  densities 

p(v  |  o)  and  p(v  |  s)  are  graphed  with  real  line  v  as  abscissa,  as  shown 
in  Figure  3. 

Partitioning  of  observation  space  into  two  parts  is  equivalent  to 
partitioning  the  real  line  —  «  <.  v<  00  by  a  point  vq,  which  is  obtained 
by  solving*  (v)  =  T  for  v  =  vq.  It  follows  that  a  and  B  are  given 
by: 

(37)  a  =  p(v  |  o)  dv 

.  v 

o 

—  —  r  v 

(38)  B  =  B  =  o  p(v  I  s)  dv. 

-  “  oo 

Equations  (37)  and  (38)  state  that  the  type  I  error  or  false-alarm  pro¬ 
bability  is  the  area  under  the  probability  density  curve  p(v  |  o)  over  the 
interval  in  v  for  which  signal  present  is  decided.  The  type  II  error  or 
false-dismissal  probability  is  the  area  under  probability  curve  p(v  |  s) 
over  the  interval  in  v  for  which  signal  absent  is  decided. 

It  can  also  be  seen  in  Figure  3  that  if  we  move  the  threshold  vq  to 
the  left  we  can  eliminate  the  cross-hatched  area  and  reduce  the  pro¬ 
bability  of  error.  In  general,  if  P  p(v  |  s>-  C  -  >  Q  p(v  I  0)  C  ,  it  is 

_  s  g  a 

advantageous  to  have  v  be  in  r  'so  that  the  smaller  quantity  will  contri¬ 
bute  to  the  integral  (36).  This  is  exactly  what  the  Bayes  decision  rule 
achieves.  If  Ct  =  C  .  and  C,  *  C.  -z  =  0*,  the  Bayes  classifier 
possesses  the  property  that  the  optimal  decision  minimizes  the  pro¬ 
bability  of  error  in  classification. 

*  If  C—  =  C  and  C,  3  C,  —  =  0,  this  is  called  a  'symmetrical' 
Ba  1-al-B 

or  'zero-one'  loss  function. 


2.  7  The  Neyman-Pearson  Criterion 


The  Neyman-Pearson  theory  of  hyposthesis  testing  antedates  the 
development  of  statistical  decision  theory.  It  does  not  require  know¬ 
ledge  of  a  priori  signal  statistics,  nor  does  it  require  an  explicit 
assignment  of  cost  functions.  An  optimum  test  is  defined  as  one  that 
minimizes  the  probability  of  certain  errors.  In  a  test  of  hypothesis  H, 
two  types  of  errors  can  be  made:  H  may  be  rejected  when  it  is  true, 
or  it  may  be  accepted  when  it  is  false.  An  optimum  test  is  one  which 
minimizes  the  probability  of  committing  both  types  of  errors  -  that  is, 
the  test  should  have  a  small  probability  of  rejecting  H  when  it  is  true 
and  a  large  probability  of  rejecting  H  when  it  is  false.  A  test  with  a 
probability  of  rejecting  H  when  it  is  true  is  called  a  'test  of  level  e'. 
The  Neyman-Pearson  criterion  asserts  that  among  all  tests  of  level  e  , 
the  'best '  test  is  the  one  which  has  tne  greatest  probability  of  rejecting 
H  when  it  is  false. 

When  applied  to  radar,  the  Neyman-Pearson  test  is  a  test  between 
two  alternative  hypotheses,  Hq  and  H^,  only  one  of  which  is  true.  The 
Neyman-Pearson  criterion  requires  that,  for  a  fixed  false-alarm  pro¬ 
bability  a  ,  a  test  be  found  that  minimizes  the  missed  target -detection 
probability  3  or,  equivalently  maximizes  the  probability  of  target  de¬ 
tection  (1  -  3  ). 

In  general,  hypothesis  can  be  a  simple  or  composite  hypothesis. 
In  the  classical  Neyman-Pearson  test,  hypothesis  Hj  is  assumed  to  be 
simple  -  that  is,  the  signal  consists  of  a  single  known  value  s  =  a  . 

The  simple  alternative  hypothesis  does  not  apply  to  radar  since  the 
target  echo  is  generally  a  function  of  many  variables.  When  signal 
space  consists  of  more  than  one  element,  H1  is  a  composite  hypothesis. 
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In  this  case,  the  probability  of  a  type  II  error  is  a  function  of  the 
signal  parameters.  For  this  situation,  the  classical  Neyman-Pear- 
son  strategy  must  be  modified. 


One  extension  of  the  Neyman- Pearson  test  strategy,  when  is 
composite  hypothesis,  is  to  minimize  the  total  type  II  error  probability 
that  has  been  averaged  with  respect  to  the  a  priori  probability  density 
of  signal.  This  requires  a  priori  statistics.  Thus,  we  minimize  P  — 

M -  P 

subject  to  a  total  fixed  type  I  error  probability  Q  .  This  extension 
is  referred  to  as  the  'modified'  Neyman- Pearson  criterion.  As  before, 
P  is  the  a  priori  probability  of  signal  present.  Q  =  1  -  P  is  the  a  priori 
probability  density  of  signal  absent,  and  6  is  given  by  equation  (20). 
Following  the  method  of  Lagrangian  multipliers,  the  best  decision  rule 
D^p,  in  the  modified  Neyman- Pearson  sense,  minimizes: 

(39)  L-Np  :P  M  X  Q  a  ,  where  X  is  the  Lagrange  multiplier 
that  is  undetermined  at  this  point.  Note,  equation  (39)  is  the 
same  as  equation  (21)  with  C  a  =  X  and  C  j  =1.  Substituting 
equations  (19)  and  (20)  into  (39)  gives: 


(40)  L  =  P 
'  1  NP 


p(v  |  s  D  (d  |  v)  dv  +  Xq 

S  ^ 


j 

With  equation  (22),  (40)  becomes: 


p(v  I  o)  D(d  I  v)dv. 

r  1 


(41)  L 


NP 


D(d  |  v)  [P  p(v  |  a)-  -  x  Q  p(v  jo)]  dv  +  A  Q. 


This  expression  is  minimized  by  choosing: 


<42>  dnp  <do  I  v)  -  1 
dnp  <di  1 7)  *  0  ) 

pp(v  |  s  y— 

(43)  l{y)  =  - -  — g- 

Q  p(v|  o) 


-  that  is,  decide  no  signal  when 


<A  ,  and  choosing 
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<44>  dnp  W0I  v>  ■ 


dnp  <dx  I  v>  -  1 


,  (signal  present)  when 


(45)  a  (v)  _>x  .  Comparing  this  rule  with  that  of  equation  (29),  we 

see  that  the  modified  Neyman-Pearson  strategy  is  identical  to  a 
Bayes  test  strategy  with  threshold  T  =  A  .  The  choice  of  X  is 
not  arbitrary  but  depends  in  the  specification  of  a  ,  since  its 
value: 


(46)  a  = 


p(v  |o)  D  (d  |  v  is  determined  by  the  survace 


NP  '”1 

l  (v)  =  X  separating  the  regions  r  '  and  r  ”  in  observation  space. 
This  strategy  is  often  employed  in  radar  problems. 

2.  8  The  Minimax  Approach 


To  apply  Bayes'  criterion  for  minimizing  average  loss,  it  is 
necessary  to  know  the  statistics  of  the  noise  process,  as  well  as 
p(v  |  s)  and  a  priori  signal  statistics  a  (s).  In  many  practical  cases, 
probability  density  function  a  (s)  is  not  known  and  it  is  not  feasible  to 
obtain  experimental  data  to  establish  o  (s).  As  a  result,  Bayes' 
criterion  cannot  be  applied.  Another  criterion  that  may  be  reasonable 
is  the  'minimax  criterion'. 


As  an  example,  consider  a  situation  in  which  signal  space  n 
contains  four  points,  denoted  by  S^,  S2<  Sy  and  S^.  It  follows  from 
(lb)  that  there  is  a  conditional  loss  Lc(D  |  S.),  i  =  1,  2,  3,  4,  associated 
with  each  member  of  signal  space.  The  values  of  conditional  loss  are 
dependent  on  decision  rule  D.  Figure  4  shows  three  sets  of  conditional 
losses  corresponding  to  three  different  decision  rules  Dj,  D^,  and  Dg. 
The  maximum  value  of  conditional  loss  is  circled  for  each  signal.  Note 
that  decision  rule  results  in  a  peak  or  maximum  conditional  loss  that 
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is  less  than  the  maximum  losses  resulting  from  decision  rules  and 
D_.  A  decision  rule  that  minimizes  the  maximum  conditional  loss  is 

u 

called  a  'minimax 1  rule.  If  the  set  of  admissible  decision  rules  con¬ 
tains  only  rules  D  ,  D  ,  and  D_,  then  rule  D_  is  the  minimax  rule. 

1  2  O  6 

A  minimax  decision  rule  D,,  results  in  a  maximum  conditional 

M 

loss  equal  to,  or  less  than,  that  resulting  from  any  other  admissible 
decision  rule  D: 

(47)  max  (D^  |  s)  <_  max  Lc  (D|  s)  D  or 

s  s 

(48)  max  L  (D,.|  s)  =  max  min  L  (D  I  s). 

—  c  M1  —  —  c  1 

s  s  D 

For  very  general  conditions,  which  are  almost  always  met  in 
(2) 

practice,  Wald  has  shown  that: 

(49)  max  minL  (D  I  s)  =  min  max  L  (D  I  s),  from 

s  D  C  D  s  C 

which  the  origin  of  the  name  minimax  is  apparent.  It  can  be 

shown  that  a  minimax  decision  rule  D_.  is  a  Bayes  rule  relative 

M  _ 

to  a  least-favorable  a  priori  distribution  o  .  (s).  Also,  the 
Bayes'  average  loss  (D^,  °  corresponding  to  and 

o^fts),  is  larger  than  the  Bayes  average  loss  corresponding  to 
any  other  a  priori  signal  distribution,  i.  e. 

(50)  L  (D  ,  0  )  >  L  (D  ,  a  )  V  CT(s),  where 

MM  Jlf  —  B  B 

L_  is  the  Bayes  loss  resulting  from  Bayes  rule  D  and  a  priori 
B  ^  B 

signal  distriubtion  a  (s).  Thus,  the  minimax  loss  is  the  largest 
Bayes  loss  when  all  a  priori  distributions  a  ( s  )  are  considered. 

For  example,  consider  the  case  where  a  test  was  obtained  for 

the  presence  of  a  positive  mean  A  in  Gaussian  noise  with  variance 
2 

o  .  When  only  one  observation  v  is  available,  the  boundary  between 
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decision  regions  in  gbservatit^  gpace  reduces  to: 

(51>  ^  1  I  *  V  "'PIT  ■ 


B 


For  known  A,  a  ,  and  specified  C  and  C  —  ,  v  is  a  function 

a  p  o 


only  of  P(since  Q  =  1  -  P).  The  error  probabilities  can  be  expressed  as: 


(52)  a  = 


p(v  I  o)  dv 


v  (P) 
o 


(53)  b  = 


,  vo(P) 


p  (v  |  A)  dv,  from  which  both  a  and  g  can  be 


found  as  functions  of  P.  From  equation  (21),  the  Bayes  average 
loss  is  given  by: 


(54)  Lb(0  )  =  (1  -  P)  a(P)  Ca  +  P  $  (P)  C  .  This  loss  can  be 
computed  for  various  values  of  a  priori  probability  P  of  signal 
present  and  plotted  as  shown  in  Figure  5.  The  maximum  loss  is 
obtained  by  differentiating  equation  (54)  with  respect  to  P  and 
setting  the  result  equal  to  zero: 

(55)  «  (P)  Ca  =  8  (P)  Cg  .  Equation  (55)  can  be  solved  for  P  =  P^, 

at  which  the  maximum  loss  occurs.  When  P  =  P,„,  the  Bayes  loss 

M 

is  equal  to  the  minimax  loss.  This  follows  since  the  minimax 
solution  corresponds  to  the  Bayes  strategy  for  the  worst  a  priori 
signal  statistics.  The  minimax  criterion  in  effect  compensates 
for  ignorance  of  the  true  state  of  nature  by  assuming  the  worst 
state  of  nature. 


To  summarize,  a  Bayes  decision  rule  takes  into  consideration  all 
of  the  a  priori  statistics  relating  to  both  signal  and  noise.  When  signal 
statistics  relating  to  both  signal  and  noise.  When  signal  statistics  are 
unavailable,  a  minimax  decision  rule  sometimes  offers  a  reasonable 
alternative.  A  minimax  rule  is  a  Bayes  rule  relative  to  a  least  favor¬ 
able  distribution;  the  minimax  average  loss  is  the  maximum  of  all  Bayes 
losses. 


2 


f 


2.  9  Bayes  Solutions  for  Complex  Cost  Functions 

In  binary  detection,  signal  s  is  a  function  of  a  number  of  para¬ 
meters  e  .  Such  parameters  often  include:  amplitude  (e  =  A),  time 
delay  (0  =  t  ),  initial  phase  (0  =<}>),  etc.  In  radar,  the  signal 

a  O 

parameters  provide  information  about  various  target  parameters  such 
as  range,  range  rate,  acceleration,  azimuth,  elevation  angle,  angular 
rate  and  acceleration.  The  signal  statistics  are  described  by  the  a 
priori  (existence)  probability  P  and  the  a  priori  probability  density  u(  0). 
The  discussion  in  this  section  differs  from  that  in  section  2.  4  in  that 
the  cost  of  a  correct  decision,  C,  —  is  not  chosen  equal  to  zero. 

1  “  P  _ 

Instead,  C  “  is  assumed  to  be  a  function  of  signal  parameters  8 

1  "  P  _ 

In  the  cost  matrix  of  Table  2,  C.  -z  (6  )  is  the  cost  of  a  correctly 
detected  signal.  Substituting  this  matrix  and  equation  (11),  with  u(s) 
replaced  by  w(  0),  into  equation  (13)  yields  the  average  loss  function: 


(56)  L(D,  a  )  =  Q  C 


D(d  |  v)  p(v  |  0)  dv 

r  1 


+  p  c 

+  Pf 


e 


-  |  ufo  )  de  |  D(d0[  Ptv  I  s<e )  1  dv 

u(s)  d  0  f  Clw-  (0)D(d1|  v)  p[v  |s(0)]dv. 

_  Jr’ 

When  C  _  (0  )  in  (56)  is  set  equal  to  a  constant  independent  of  para- 
1  8 

meters  0,  it  can  be  shown  that  minimizing  equation  (56)  leads  again 
to  a  Bayes  solution,  similar  to  that  discussed  in  section  2.5,  in  terms 
of  the  generalized  likelihood  ratio  of  equation  (28).  With  equation  (22), 
equation  (56)  reduces  to: 


(57)  L(D,o  )  =  P  C-  + 


Did  |v)  [p  f  C  g  (9)w(  0)p[v  |s(0)]d  0 
r  s 


-PC-  p[v  I s(e)J  —  +  Q  C  p(v|  0)  }  dv. 


Minimizing  (57)  yields  the  Bayes  decision  rule  D  : 

B 
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(58)  Db  (d1  |  v)  =  1 


db  (dol  v)  =  0 


(signal  present)  (see  equations  (26a)  and  (26b). 


when: 

(59)  P  { C—  p[v  |Ife)]-  - 

P  “ 


-  (e)u(0)p[v|  s(e)  ]  de}^QCa  p(v|o) 


Table  2  -  Cost  Matrix  for  Complex  Cost  Functions 


s  = 

0 

signal  s 

s  ^  0 

Decision  j 

fd’ 

1 

o 

1 

U 

_  Ca  C1  -  B  (0J 

Otherwise,  decide  signal  absent.  This  inequality  can  be  rewritten  as: 

(60)  P  p[ v  |  s(9)  ]- 

P  4C1  -B  (e)u(0)P 

v  |  s(0  )]  d©  C  a 

Q  P  (v  |  0) 

QC-  p(v  |  0) 

P 

?• 

c_ 

B 

The  first  term  in  (60)  is  the  generalized  likelihood  ratio  defined  in  (28). 
Equation  (60)  is  similar  to  equations  (29a)  and  (29b)  with  the  addition  of 
a  second  term  depending  on  cost  assignment  Cj  j  (e). 


One  example  of  this  type  of  problem  is  for  a  signal  parameter 
0  j  =  t  ,  where  t  is  the  expected  time  of  signal  arrival,  as  approxi¬ 
mated  by  a  discrete  set;  assuming  two  different  cost  matrix  assignments. 

In  one  case,  C  _  (e )  is  set  equal  to  zero  which  yields  a  solution  which 
*  -  B 

depends  on  the  generalized  likelihood  ratio  and  results  in  a  Bayes  re¬ 
ceiver  that  averages  the  output  of  a  matched  filter  with  respect  to  the  a 
priori  probability  density  of  t  . 


In  the  second  case,  the  cost  function  C,  t  ( 9)  is  chosen  to  be  a 

1  -  p 

step  function  with  penalty  Cm  when  the  detection  occurs  at  an  arrival 
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time  other  than  the  true  value.  The  Bayes  strategy  for  this  case  is  a 
threshold  test  for  each  of  a  set  of  discrete  expected  arrival  times.  The 
threshold  is  determined  by  both  the  cost  assignments  and  the  a  priori 
probability  density  of  expected  arrival  time  x  .  This  strategy  cor¬ 
responds  to  the  use  of  separate  range  bin  tests  -  which  is  intuitively 
reasonable. 

2.  10  Preferred  Neyman-Pearson  Strategy 

In  many  situations,  the  previous  approach  cannot  be  used  because 
the  a  priori  statistics  are  lacking  on  a  reasonable  basis  for  choosing 
cost  penalties  (0  )  is  not  available.  An  alternative  is  the  preferred 
Neyman-Pearson  strategy.  This  strategy  is  to  find  a  decision  surface 
which  separates  the  acceptance  and  rejection  regions  (with  respect  to 
hypothesis  HQ)  such  that  Type  II  error  probability  3(6)  is  minimized  for 
a  fixed  value  of  a  (the  level  of  the  test);  or  equivalently,  the  probability 
of  detection  (the  power  of  the  test)  is  maximized.  Since  Type  II  error 
probability  3  (6 )  is  in  general  a  function  of  signal  parameters  0,  the 
solution  differs  for  each  set  of  parameters.  In  special  cases,  the  test 
is  the  same  for  all  admissible  values  of  0  .  Such  a  test  is  called  'uni¬ 
formly  most  powerful'.  These  tests  do  not  often  occur. 

When  a  uniformly  most  powerful  solution  cannot  be  found,  other 
criteria  can  be  employed.  For  example,  the  class  of  tests  may  be 
reduced  by  considering  only  those  with  some  additional  desirable  chara¬ 
cteristics.  A  uniformly  most  powerful  test  may  then  exist  within  the 
reduced  class. 

2.  11  Intuitive  Substitute 

When  a  uniformly  most  powerful  test  does  not  exist,  an  alternate 
intuitive  strategy  is  to  average  the  power  of  the  test  -  that  is,  the  pro¬ 
bability  detection  P^  ( 0)  -  with  respect  to  both  the  a  priori  probability 


density  function  governing  the  signal  parameters  0^,  whose  statistics 
are  unknown.  A  test  is  then  sought  that  miximizes  the  average  detection 
probability.  This  approach  is  related  to  the  modified  Neyman-Pearson 
strategy  previously  discussed  in  which  P^(  6)-^  is  maximized  for  a  fixed 

level  a  and  to  the  minimax  strategy,  where  an  averaging  is  performed 
with  respect  to  least  favorable  a  priori  statistics  (0  ^).  This  is  a 
conservative  philosophy  since,  on  the  average,  the  value  of  P^  obtained 
is  the  worst  that  can  be  expected. 

For  some  radar  parameters,  solutions  obtained  with  the  intuitive 
substitute  approach  yield  good  results.  In  other  cases,  poor  results 
are  obtained.  For  example,  consider  the  radar  parameters:  amplitude 
A,  delay  t  ,  doppler  and  initial  phase  e.  Statistical  information  is 
often  available  concerning  signal  amplitude;  this  is  expressed  by  des¬ 
cribing  the  target  model  as  Rayleigh,  one  dominant  plus  Rayleigh  the  so- 
called  Swerling  models  (4),  (5).  Averaging  P^  with  respect  to  the  appro¬ 
priate  amplitude  probability  density  generally  leads  to  a  satisfactory 
result.  On  the  other  hand,  the  intuitive  approach  is  generally  unsatis¬ 
factory  for  both  delay  and  doppler.  In  particular,  averaging  over  the 
regions  of  uncertainty  of  delay  and  doppler  leads  to  an  unsatisfactory 
test  in  a  multiple-target  environment. 

For  starting  phase  0  ,  the  intuitive  approach  does  provide  satis¬ 
factory  performance.  A  priori  information  concerning  0  is  usually 
unavailable;  hence,  a  least  favorable  distribution  -  a  uniform  pro¬ 
bability  density  function  is  employed.  Averaging  phase  leads  to  an 
optimum  receiver  structure  in  which  a  matched  filter  is  followed  by 
an  envelope  detector.  When  compared  to  an  optimum  receiver  for  which 
0  is  assumed  to  be  known,  it  can  be  shown  that  the  loss  in  detectability 
for  0  with  a  uniform  probability  density  is  small  (less  than  1  Db)  in  the 
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region  of  primary  concern  to  the  radar  designer,  namely,  high  signal  - 
to-noise  ratio. 


2.  12  Fixed  and  Sequential  Testing 

In  the  foregoing  discussion,  it  has  been  tacitly  assumed  that  a 
decision  is  made  after  a  fixed  observation  interval  in  which  data  are 
collected.  The  observations  made  during  this  interval  may  consist, 
in  general,  of  discrete  or  (sampled)  continuous  input  waveforms.  In 
some  systems,  the  observation  interval  is  not  fixed  but  is  of  variable 
length  and  is  dependent  on  the  input  data.  This  might  be  an  advantage 
where  it  is  desirable  to  keep  the  observation  interval  as  short  as  pos¬ 
sible.  For  example,  when  a  large  radar  echo  signal  is  received  from 
a  nearby  target,  it  may  be  desirable  to  take  advantage  of  this  circum¬ 
stance  to  shorten  the  observation  interval. 

A  test  procedure  for  a  variable- length  observation  period  has  been 
developed  by  Wald  and  is  known  as  a  'sequential  test'  (6).  A  similar 
concept  was  considered  by  Neyman  and  Pearson  in  1933  as  an  extension 
of  their  theory  of  hypothesis  testing.  They  defined  three  possible  de¬ 
cisions:  accept  H,  reject  H,  and  no  decision.  In  Wald's  method,  it  is 
decided  whether  to  make  a  decision  based  upon  the  data  already  taken 
or  to  continue  taking  more  data  following  each  measurement.  Thus, 
the  length  of  the  observation  interval  depends  on  the  quality  of  the 
available  data.  Although  it  is  theoretically  possible  for  a  test  to  con¬ 
tinue  indefinitely,  it  has  been  demonstrated  that  on  the  average  the 
observation  interval  is  shorter  in  a  sequential  test  than  in  a  fixed  test. 
Furthermore,  in  practice,  a  sequential  test  is  usually  truncated  after 
some  predetermined  number  of  observations. 


2.  13  Concluding  Remarks 

The  application  of  statistical  decision  theory  to  problems  in  com¬ 
munications  and  radar  is  being  actively  pursued  and  is  identical  to 
similar  efforts  to  apply  the  theory  to  other  fields  such  as  character, 

speech,  and  speaker  recognition,  weather  prediction,  medical  diagnosis, 

(3) 

and  stock  market  prediction.  Despite  its  power,  certain  limitations, 
restrict  its  range  of  application.  These  limitations  result  from  the 
requirements  on  the  system  model  which  can  never  be  completely 
satisfied  in  practice. 

One  limitation  has  to  do  with  cost  assignments.  Assignments  are 
usually  made  by  the  system  designer,  and  therefore,  are  subject  to 
individual  bias.  Fortunately  in  many  applications,  the  structure  of 
the  optimum  system  is  insensitive  to  variations  in  cost  assignment. 

For  example,  the  structure  of  a  Bayes  receiver  for  simple  radar  binary 
detection  is  independent  of  the  magnitude  of  the  preassigned  costs. 

This  is  not  true  for  complex  cost  assignments,  however. 

A  more  fundamental  limitation  stems  from  the  need  for  a  priori 
information  concerning  both  the  signal  and  noise  processes.  If  such 
information  is  not  available,  the  theory  cannot  be  rigorously  applied. 

If  a  priori  information  is  available  about  the  noise  process  but  signal 
statistics  are  unavailable,  a  solution  may  be  possible  by  invoking  other 
criteria  such  as  minimax. 

In  cases  where  more  sophistication  is  required  and  less  sensitivity 
to  underlying  distributions  is  desired  ('robust'  procedures),  an  adaptive 
system  in  which  the  system  decision  rule  varies  as  'learning'  takes  place 
is  desirable. 
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The  use  of  a  decision  tree,  evaluating  the  features  sequentially 
until  a  decision  is  made,  requires  considering  the  cost  of  measuring 
features  as  well  as  the  cost  of  making  errors.  This  is  the  subject  of 
sequential  decision  theory  (6).  Reference  (13)  shows  how  techniques 
developed  for  searching  game  trees  can  be  applied  to  such  problems. 


28 


Figure  2.  Example  of  parameter  estimation 
as  a  decision  problem 
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(7) 

3.  0  Parameter  Estimation  and  Supervised  Learning 

In  section  2.6  it  was  shown  that  an  optimal  classifier  for  detection, 

(two-class  problem),  could  be  designed  if  the  a  priori  probabilities,  P 

and  Q,  and  the  class -conditional  densities  p(v  |  s)-  and  p(v  |  0)  were 

s 

known.  Unfortunately,  in  pattern  recognition  applications  we  rarely 
have  this  'luxury*.  In  a  typical  case  we  have  some  vague,  general 
knowledge  and  a  number  of  design  samples,  the  classification  of  which 
are  known. 

One  approach  to  designing  the  classifier  is  to  use  the  samples; 

waveforms  in  which  we  know  that  a  target(signal)  is  /is  not  present 

to  estimate  the  unknown  probabilities  and  probability  densities  -  and 

use  the  resulting  estimates  as  if  they  were  true  values.  Usually,  the 

estimation  of  the  class-conditional  densities  is  not  feasible  since  the 

number  of  available  samples  (waveforms)  is  almost  always  too  small 

for  the  time  available.  If  we  can  parameterize  the  conditional  densities, 

the  severity  of  the  problem  can  be  significantly  reduced.  Suppose,  for 

example,  that  we  can  reasonably  assume  the  p(v  j  s)—  comes  from  a 

s 

distribution  with  mean  y  and  covariance  matrix  £  ,  although  we  do 

s  s 

not  know  the  exact  values  of  these  quantities.  The  problem  is  then 

simplified  to  be  one  of  estimating  y  and  i  ,  and  not  the  pro- 

s  s 

babilities. 

The  problem  of  parameter  estimation  can  be  approached  in  several 
ways.  Two  of  these  procedures,  outlined  in  Section  2  are  the  'maximum 
likelihood'  estimation  and  'Bayesian'  estimation.  Although  the  results 
obtained  by  these  two  procedures  are  often  nearly  identical,  the  ap¬ 
proaches  are  conceptually  quite  different.  Maximum  likelihood  methods 
view  the  parameters  as  quantities  whose  values  are  fixed  but  unknown. 
The  best  estimate  is  defined  to  be  one  that  maximizes  the  probability 
of  obtaining  the  samples  actually  observed.  Bayesian  methods  view  the 
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parameters  as  random  variables  having  some  known  a  priori  distribu¬ 
tion.  Observation  of  the  samples  converts  this  to  an  a  posteriori  density 
changing  our  opinion  about  the  true  parameter  values. 

In  the  Bayesian  case,  the  typical  effect  of  observing  additional 
samples  is  to  sharpen  the  a  posteriori  density  function,  causing  it  to 
peak  near  the  true  values  of  the  parameters.  This  phenomenon  is 
known  as  'Bayesian  learning'.  We  shall  consider  only  this  case  here. 

3.  1  General  Bayesian  Learning 

Let  us  assume  that  the  class  'target  present'  is  signified  by  the 
symbol  'u2',  and  target  absent  by  'u^'.  Let  X  denote  a  set  of  samples 
(e.  g.  ,  waveforms  representing  a  scan  of  the  radar).  We  can  emphasize 
the  role  of  the  samples  by  stating  that  our  goal  is  to  compute  the  a  post¬ 
eriori  probabilities  P(cj^  |  v,  X).  From  these  probabilities  we  obtain 

the  Bayes  classifier:  __ 

p(v  |  w  ,  X)  P(w.,  X) 

(61)  P(u.|  v.X)  =  ^ - - - - - 

.2.  P(v  |  W  .  X)  P(u  |  X) 

J  J  J 

Thus,  we  can  use  the  information  provided  by  the  samples  to  help 
determine  both  the  class-conditional  densities  and  the  a  priori  probabi¬ 
lities. 

It  will  be  assumed  that  the  true  values  of  the  a  priori  probabilities 
are  known  so  that  PfuJ  X)  =  PfuJ.  Thus,  in  our  case  P(Uj)  =  Q,  and 
P(u2)  =  P.  Furthermore,  in  treating  the  'supervised  learning'  case 
we  can  separate  the  samples  by  class  into  two  subsets  X.  and  X„,  with 

X  c* 

the  samples  in  X^  belonging  to  w..  In  the  cases  treated  here,  we  assume 

that  the  samples  in  X.  have  no  influence  in  p(v  |u.,  X)  if  i  /  j.  This  has 

J  i 

two  simplifying  consequences  .  First,  it  allows  us  to  work  with  each 
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class  separately,  using  only  the  samples  in  X.  to  determine  p(v  |  u.,  X). 

This  allows  us  to  write  equation  (61)  as: 

_  P(v  I  w.,  X.)  P(w.) 

(62)  P(U  |v,  X)  =  - - - i - - 

r  p(v  |  X.)  P(w.) 

j=l  J  J  J 

A  second  simplifying  consequence  is  that  each  class  can  be  treated 
independently,  and  we  can  dispense  with  needless  class  distinctions  and 
simplify  our  notation.  In  essence,  we  have  2  separate  problems  of  the 
following  form:  use  a  set  X  of  samples  drawn  independently  according 
to  the  fixed  but  unknown  probability  p(v)  to  determine  p(v  |  X). 

Although  the  desired  probability  density  p(v)  is  unknown,  we  assume 
that  it  has  a  known  parametric  form.  The  only  thing  assumed  unknown 
is  the  value  of  the  parameter  vector  e .  The  fact  that  p(v)  is  unknown, 
but  of  known  parametric  form,  will  be  expressed  by  saying  that  the 
function  p(v  j  e  )  is  completely  known.  The  Bayesian  approach  assumes 
that  the  unknown  parameter  vector  is  a  random  variable.  Any  informa¬ 
tion  we  might  have  about  0  prior  to  observing  the  samples  is  assumed 
to  be  contained  in  a  known  a  priori  density  p(0 ).  Observation  of  the 
samples  converts  this  to  an  a  posteriori  density  p(  0,  X),  which  we  hope 
will  be  sharply  peaked  about  the  true  value  of  0. 

Our  basic  goal  is  to  compute  p(v  |  X),  which  is  as  close  as  we  can 
come  to  obtaining  the  unknown  p(v).  We  do  thi  s  by  integrating  the  joint 
density  p(v,  0  |  X)  over  0  : 

(63)  p(v  |  X)  =  / p(v,  0|  X)  d@  ,  where  the  integration  extends  over  the 
entire  parameter  space.  We  can  always  write  p(v,  e  j  X)  as  the  product 
of  p(v  (  0 ,  |  X)  p(  e  |  X).  Since  the  selection  of  v  and  of  the  samples  in  X 
is  done  independently  the  first  factor  is  merely  p(v  |  0).  That  is,  the 
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distribution  of  v  is  known  completely  once  we  know  the  value  of  the  para¬ 
meter  vector. 

Thus: 

(64)  p(v  |  X)  =  jp(v  |e")  p(e  |  X)  de  . 

Equation  (64)  links  the  desired  density  p(v  |X)  to  the  a  posteriori 
density  p(  e|  X)  for  the  unknown  parameter  vector.  If  p(e|  X)  peaks 

__  _  /V 

very  sharply  about  some  value  9  ,  we  obtain  p(v  |  X)  p(X.  e  ),  which  is 
the  result  we  would  obtain  by  substituting  the  estimate  9  for  the  true 
parameter  vector.  If  we  are  less  certain  about  the  exact  value  of  F  , 
equation  (64)  directs  us  to  average  p(v  j0  )  over  the  possible  values  of 
9  .  Thus,  when  the  unknown  densities  have  a  known  parametric  form, 
the  samples  exert  their  influence  in  p(v  I  X)  through  the  a  posteriori 
density  p(e  j  X). 

The  basic  assumptions  for  Bayesian  learning  are  then: 

(1)  The  form  of  the  density  p(v  1 9  )  is  assumed  to  be  known,  but  the 
value  of  the  parameter  vector  e  is  not  known  exactly. 

(2)  Our  initial  knowledge  about  0  is  assumed  to  be  contained  in  a 
known  a  priori  density  p(  9). 

(3)  The  rest  of  our  knowledge  about  9  is  contained  in  a  set  X  of 

n  samples  v.,  v_,  .  .  .  v  drawn  independently  according  to  the 
12  n 

unknown  probability  law  p(v). 

The  basic  problem  is  to  compute  the  a  posteriori  density  p(  0  |  X), 
since  from  this  we  can  use  equation  (64)  to  compute  p(v  I  X). 

By  Bayes  rule, 

(65)  p("e  |X)  =  ,1  e  ■  ?{  —  ,  and  by  the  assumption  that  the 

/  p(X  |  0  p(  0)  4 


samples  are  independent: 

(66)  p(X  |  e )  =  I  I  p(v  I  0). 

k=  1 

This  consitutues  the  formal  solution  to  the  problem.  Equations 
(64)  and  (65)  illuminate  its  relation  to  the  maximum  likelihood  solution. 
Suppose  that  p(X  |  e)  reaches  a  sharp  peak  at  6  =  6  .  If  the  a  priori 
density  p(  e)  is  not  zero  at  0  =  e  and  does  not  change  much  in  the  sur¬ 
rounding  neighborhood,  then  p(  e  |  X)  also  peaks  at  that  point.  Thus, 
equation  (64)  shows  that  p(v  j  x)  will  be  approximately  p(v  |  e"),  the 
result  obtained  by  using  the  maximum  likelihood  estimate  as  the  true 
value.  If  the  peak  of  p(X  |  e )  is  not  so  sharp  that  the  influence  of  a 
priori  information  or  the  uncertainty  in  the  true  value  of  e  can  be 
ignored,  then  the  Bayesian  solution  tells  us  how  to  use  the  available 
information  to  compute  the  desired  density  p(v  |  X). 


To  indicate  explicitly  the  number  of  samples  in  a  set,  we  write 

Xn  =  v  .  v  ,  .  .  .  v  .  Then,  from  equation  (65),  if  n>  1. 
i  6  n 

n  _  _  _  n  _ 

(67)  p(X  |  9)  =  p(vn  |  8)  p(X  6).  Substituting  equation  (67)  into 

equation  (65)  and  usj.ng  Bayes'  rule, 

n  p(vn  I  e)p(e  I*  > 

(68)  p(  e  |  X  )  =  — -^ - Z — Z - ^ZT" 

J  P<vn  |  8>  P<  9  I  X  >  d9 


With  the  understanding  that  p(  9  j  X°)  =  p(0  ),  repeated  use  of  this  equa¬ 
tion  produces  the  sequence  of  densities  p(  0),  p(  0  j  v^  p(0  |  v^,  v^),  etc. 

This  is  called  the  'recursive'  Bayes  approach  to  parameter  estimation. 
When  this  sequence  of  denisities  converges  to  a  Dirac  delta  function  cen¬ 
tered  about  the  true  parameter  value,  the  resulting  behavior  is  frequently 
called  'Bayesian  learning'. 
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For  most  of  the  typically  encountered  probability  densities 
p(v  |  e),  the  sequence  of  a  posteriori  densities  does  converge  to  a 
delta  function.  This  implies  that  with  a  large  number  of  samples 
there  is  only  one  value  for  e  that  causes  p(v  |  e  )  to  fit  the  data,  i.  e. , 
that  0  can  be  determined  uniquely  from  p(v  1 0  ).  When  this  is  the 
case,  p(v|  0)  is  said  to  be  ’identifiable'. 

There  are  occasions,  however,  when  more  than  one  value  of  0 
may  yield  the  same  value  for  p(v  |  0),  (the  'multimodal'  case).  In 
such  cases,  0  cannot  be  determined  uniquely  from  p(v  |0),  and 
p(  0  |  Xn)  will  peak  near  all  of  the  values  of  0  that  explain  the  data. 
Fortunately,  this  ambiguity  is  erased  by  the  integration  in  equation 
(64),  since  p(v  1 0  )  is  the  same  for  all  of  these  values  of  0  .  Thus, 
p(v  |  Xn)  will  typically  converge  to  p(v)  whether  or  not  p(v|  0  )  is 
identifiable  when  supervised  learning  is  considered.  When  the  class¬ 
ification  of  samples  is  not  known  a  priori  as  in  'unsupervised  learning', 
identifiability  is  one  of  the  major  problems. 

Appendix  A  includes  a  description  of  a  Bayesian  classifier  pro- 

(20) 

gram 
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4.  0  Unsupervised  Learning  and  Clustering 

In  supervised  learning,  the  membership  of  the  training  samples 
used  to  design  the  classifier  are  assumed  known.  In  'unsupervised ' 
learning,  the  membership  of  the  training  samples  is  unknown  a  priori. 

This  is  precisely  the  type  of  problem  characterized  by  the  radar 
detection  of  a  target  in  a  background  of  noise  and  clutter.  The  reasons 
for  this  are  as  follows. 


Firstly,  the  collection  and  labeling  of  a  large  set  of  sample  pat¬ 
terns  and  their  categorization  can  be  costly  and  time  consuming.  If 
we  could  crudely  design  a  classifier  based  upon  a  small  set  of  samples 
whose  classification  is  known,  and  then  allow  it  to  run  without  super¬ 
vision  on  a  large,  unlabeled  set,  we  might  save  a  good  deal  of  effort. 

Secondly,  in  applications  such  as  the  radar  detection  of  targets 
in  ground  clutter,  the  signal  -  as  well  as  the  background  -  can  change 
slowly  with  time.  An  unsupervised  mode  classifier  can  track  these 
changes  and  make  timely  corrections. 


Additionally,  in  the  early  stages  of  an  investigation  such  as  this, 
it  is  necessary  to  gain  some  insight  into  the  nature  and  structure  of 
the  data  as  applied  to  pattern  recognition.  The  discovery  of  unantici¬ 
pated  subclasses  may  significantly  alter  the  classifier  design. 


4.  1  Mixture  Densities  and  Identifiability 

As  a  take-off  point,  let  us  assume  the  following: 

(1)  The  samples  come  from  two  classes. 

(2)  The  a  priori  probabilities,  P  and  Q,  are  known. 

(3)  The  forms  for  the  class- conditional  probability  densities 

p(v  |  w.,  e  .),  j  =  1,  2  are  known. 

J  J 
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(4)  All  that  is  unknown  are  e  and  e  ^ • 

The  probability  density  function  for  samples  assumed  to  be 
obtained  by  selecting  a  state  of  nature  with  the  a  priori  probabilities 
P  and  Q  is: 

(69)  p(v|  8)  =  p(v  |(jj»  e^p  +  pfv  |u2>  0  2>  Q- 

A  density  of  this  form  is  called  a  'mixture  density'.  The  condi¬ 
tional  densities  p  (v  |  u.,  e"  .)  are  called  the  'component  densities', 

J  3 

and  P  and  Q  are  the'mixing  parameters'.  The  mixing  parameters  can 

be  included  among  the  unknown  parameters,  but  we  shall  assume  that 

only  the  9  . 's  are  unknown. 

J 

As  discussed  in  section  3.  1,  a  density  p  (v  |  6)  is  said  to  be 
'identifiable'  if  M  0  '  implies  that  there  exists  a  v  such  that 
P  (v  |  e  )  /  p  (v  |  e  0. 

Most  mixtures  of  commonly  encountered  density  functions  are 

identifiable.  Discrete  distribution  mixtures  are  often  not  identifiable. 

We  will  assume  further  that  the  mixture  densities  are  identifiable. 

,  0  „  ,  .  (15) 

4.  2  Clustering 

Although  nothing  is  assumed  to  be  known  about  the  category 
structure,  one  frequently  has  some  intuitive  feelings  about  desirable 
and  undesirable  features  for  a  classification  scheme.  One  might  ask, 
"Why  not  enumerate  all  the  possibilities  and  choose  the  best"? 


The  number  of  ways  of  sorting  n  observations  into  m  groups  is 
given  by: 


,(m) 


n 


1 

m  ) 


m 

E 

k=0 


(-1) 


m-k 


m 


n 


Even  for  the  detection  problem,  where  m  =  2,  and  a  number  of 
observations  n  =  25,  the  number  of  combinations  is  16,777,215.  For 
n  =  25,  m  =  3,  the  number  grows  to  141,  197, 991,  025. 


If  the  number  of  groups  (classes)  is  unknown,  the  number  of  possibi- 

18 

lities  rises  to  >  4  x  10  .  This  makes  an  exhaustive  examination  of 

the  alternatives  impractical. 

Cluster  algorithms  are  used  to  generate  hypotheses  about  cate¬ 
gory  structure.  A  role  often  exploited  is  that  of  discovering  'natural 
classes'.  If  a  suitable  algorithm  is  applied  to  a  set  of  data,  and  the 
resulting  clusters  are  only  weakly  differentiated,  then  the  data  probably 
belong  to  only  one  class.  Thus,  the  user  of  a  clustering  algorithm  is 
often  trying  to  understand  the  data  set  and  uncover  what  structure  re  - 
sides  in  the  data. 

4.2.1  Clustering  Methodology^ 16  ^ 

As  discussed  in  Reference  16,  the  number  of  clustering  techniques 
can  be  considered  in  three  groups  -  minimization  of  squared  error, 
hierarchical,  and  graph-theoretic.  Each  of  these  techniques  will  be 
briefly  discussed. 

.  1  Squared- Error  Clustering  Algorithms 

Squared-error  algorithms  try  to  define  clusters  which  are 

hyper  ellipsoidal  in  shape.  Let  the  ith  pattern,  i  =  1,  .  .  .  n 

from  the  data  set  under  study  be  written  as: 

T  T 

(70)  xt  =  (x.j,  x.2,  ....  x.N)  ,  where  (  ) 

indicates  the  vector  transpose  (a  column  vector  in  this  case).  The 
number  of  patterns,  n,  is  assumed  to  be  much  greater  than  the 
number  of  features,  N.  A  clustering  is  a  partition  [C  ,  C  ,  .  .  .C 
of  the  integers  [1,2,...,  n]  that  assigns  each  pattern  a  single 
cluster  label.  The  patterns  corresponding  to  the  integers  in  Ck 
form  the  kth  cluster,  whose  center  is: 


r 


(71)  Ck  =  (Ckl*  Ck2 - *  Ckn)T*  Where 

(72)  c  =  i  x  .,  and  M  is  the  cardinality  of  C  (the 

kj  Mk  ieCk  U  k  k 

number  of  patterns  in  cluster  k).  Thus,  a  cluster  center  is 
the  centroid,  or  sample  mean,  of  all  patterns  in  the  cluster. 

The  squared  error  for  cluster  k  is: 

2  T 

(73)  e,  =  z  (x.  -  c  )  (x.  -  c  ),  and  the  squared  error  for  the 

k  l  k  i  k 

l  eC. 
k 

clustering  is: 

2  K  2 

(74)  Ek  **  V 

K“  1 

The  squared  error  for  eq.  (74)  can  be  expressed  in  many  ways, 

such  as  the  sum  of  "within"  and  "between"  squared  errors  used  in 

(7) 

discriminant  analysis.  The  objectives  are  to  define,  for  a  given  K, 

2 

a  clustering  that  minimizes  Ev,  and  to  find  a  suitable  K,  much  smaller 

1\ 

than  n.  Since  an  exhaustive  search  is  computationally  infeasible,  the 
various  squared-error  programs  implement  different  tactics  for  search¬ 
ing  through  the  possible  clusterings.  All  programs  try  to  find  a  local 
2 

minimum  of  E  .  The  user  hopes  that  this  local  minimum  also  coincides 
K 

with  the  global  minimum. 

An  example  of  such  a  methodology  is  Forgy's  method,  for  which  a 
simplified  flow  chart  is  shown  in  Fig.  4.  1.  The  heart  of  the  method  is 
the  inner  loop  in  Fig.  4.  1  which  establishes  the  way  in  which  clusters 
are  updated.  Given  a  set  of  cluster  centers,  the  cluster  label  of  the 
closest  cluster  center  is  assigned  to  each  pattern.  The  cluster  centers 
are  then  recomputed  as  sample  means,  or  entroids,  of  all  patterns 
having  the  same  cluster  label. 
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A  new  cluster  is  created  in  the  inner  loop  when  a  pattern  is  found 
that  is  sufficiently  removed  from  the  existing  structure.  Let  d^(i)  be 
the  distance  between  pattern  i  and  cluster  center  k.  Let  d(i)  be  the 
average  distance  from  pattern  i  to  all  K  cluster  centers. 

(75)  d(i)=-±  E  d  (i). 

K  k=l  k 

A  new  cluster  is  created,  centered  at  pattern  i  if: 

(76)  I  d,  (i)  -  d(i)  !  .<  I  d(i)  I, where  k  is  the  cluster  center 

k  —  F  o 

o 

closest  to  pattern  i  and  T  is  a  user-supplied  threshold  betwen  zero 

F 

and  one.  The  larger  T  ,  the  more  new  clusters  that  will  be  created. 

The  inner  loop  is  repeated  until  either  two  successive  passes  through 
all  patterns  produce  the  same  clustering  or  a  user-supplied  limit,  L_,, 

r 

on  the  number  of  loops  has  been  exceeded.  The  number  of  patterns, 

M^,  in  cluster  k  is  then  computed  for  each  k  and  compared  to  the  user- 
supplied  number  N  .  If  M,  <  N  ,  all  patterns  in  cluster  k  are  removed 
and  henceforth  ignored.  Thus,  such  patterns  are  considered  to  be 
'outliers'.  This  is  the  only  means  available  in  FORGY  for  reducing  the 
number  of  clusters. 

FORGY  was  constructed  to  be  as  direct  as  possible.  The  initiali¬ 
zation  procedure  follows  this  philosophy  and  fixes  the  initial  number  of 

cluster  centers  by  selecting  K„  patterns,  where  K  is  supplied  by  the 

F  F 

user. 

A  program  listing  of  the  program  which  analyzes  data  stored  in  core 
memory  via  Forgy's  and  Jancey's  method  as  written  for  a  CDC  6500- 
type  computer  is  shown  in  Appendix  B. 
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.  2  Hierarchical  Clustering 

The  hierarchical  clustering  techniques  produce  a  'dendrogram' 
which  describes  the  clustering  of  the  patterns.  The  dendrogram  con¬ 
nects  groups  of  patterns  at  levels  of  similarity.  It  may  be  used  to 
group  the  patterns  into  a  given  number  of  clusters  as  well  as  to  indi¬ 
cate  how  many  clusters  there  are  at  a  given  similarity  level.  Similarity 
is  often  defined  from  the  interpattern  distances.  A  program  descrip¬ 
tion  of  such  a  program  is  described  in  Appendix  C. 

These  techniques  begin  with  a  triangular  dissimilarity  matrix, 
whose  rows  and  columns  correspond  to  patterns,  and  whose  entries 
measure  dissimilarity  between  patterns;  the  larger  the  entry,  the 
more  dissimilar  the  patterns. 

The  number  of  patterns  that  can  be  handled  by  such  methods  is 
limited  since  such  techniques  are  very  expensive  in  computer  time  and 
memory. 

.  3  Graph-Theoretic  Methods 

Not  all  natural  groupings  of  patterns  are  globular  or  hyperellip- 
soidal  in  shape.  For  example ,  patterns  that  are  spaced  along  a 
straight  line  or  in  a  plane  in  the  pattern  space  are  well  structured. 
Squared-error  methods  force  a  globular  or  Gaussian-based  model 
on  such  structures  and  cannot  work.  Graph-theoretic  methods  provide 
one  means  for  uncovering  unconventional  data  structures. 

(17) 

One  example  is  the  technique  of  Zahn  to  produce  a  minimal 
spanning  tree.  A  description  of  the  program  is  included  in  Appendix 
D.  The  routine  generates  a  minimal  spanning  tree,  and  then  evaluates 
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the  tree  for  self-consistent  clusters  of  patterns.  The  algorithm  used 
is  that  of  Prim  and  Dijkstra^1^'  as  implemented  by  Whitney 
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FIGURE  4.1  -  FORGY/JANCEY  METHOD 
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5.0  Testing  Methods 

Given  a  set  of  data  and  various  algorithms  to  analyze  the  data,  the 
question  to  answer  is,  "which  algorithm  performs  best  with  the  given 
data? 

If  the  underlying  phenomenology  of  the  data  is  either  unknown  or 
changes  so  much  due  to  the  influence  of  various  factors,  a  parametric 
statistical  technique  may  become  impractical.  Considering  what  is  known 
about  the  statistics  of  radar  ground  clutter,  these  remarks  seem  to  apply. 

The  term  "best"  in  the  context  of  this  problem  will  be  defined  to 
mean  the  algorithm  which  yields  the  lowest  average  probability  of  error 
when  operating  on  a  known  data  set.  This  definition  may  not  be  entirely 
adequate  when  faced  with  implementing  the  algorithm  to  give  real-time 
radar  operation.  However,  the  overriding  concern  at  the  initial  evalua¬ 
tion  phase  is  to  determine  which  of  the  various  algorithms  proposed  are 
most  compatible  with  the  general  data  structure  presented  by  radar 
ground  clutter. 

Even  at  this  early  evaluation  phase,  certain  constraints  on  economy 
of  cost  and  computer  running  time  force  the  consideration  of  a  limited 
data  set  for  the  evaluation. 

Since  the  data  set  is  to  be  limited  for  economy,  we  see  that  the 
problem  with  having  only  a  small  number  of  samples  is  that  the  resulting 
classifier  will  not  perform  well  on  new  data.  The  error  rate  is  therefore 
expected  to  be  a  function  of  the  number  of  samples,  typically  decreasing 
to  some  minimum  value  as  the  number  of  samples  becomes  much  larger. 
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One  approach  to  estimating  the  error  rate  is  to  compute  from  an 
empirically  derived  parametric  model.  There  are  many  pitfalls  to  this 
approach  -  not  the  least  of  which  is  the  uncertainty  of  the  underlying  pro¬ 
bability  distributions. 

An  empirical  approach  is  to  test  the  classifier  experimentally.  In 
practice,  this  is  frequently  done  by  running  the  classifier  on  a  set  of 
test  samples  using  the  fraction  of  the  samples  misclassified  as  an  esti¬ 
mate  of  the  error  probability.  Obviously  the  test  samples  should  be  dif¬ 
ferent  from  the  design  samples,  or  the  results  will  be  highly  optimistic. 
If  the  true  but  unknown  error  rate  of  the  classifier  is  "p”,  and  if  k  of  the 
n  independent,  randomly  drawn  tests  samples  are  misclassified,  then  k 
has  the  binomial  distribution. 

(77)  P(k)  =<”)  pk(l  -  p)n"k  . 

Thus,  the  fraction  of  test  samples  misclassified  is  the  maximum  likeli¬ 


hood  estimate  for  p: 
(78) 


The  properties  of  p  for  a  binomial  distribution  are  well  known.  Figure 

(  7  ) 

5. 1  shows  the  95%  confidence  intervals  as  a  function  of  p  and  n  .  For  a 
given  value  of  p,  the  probability  is  0.95  that  the  true  value  of  p  lies  be¬ 
tween  the  upper  and  lower  curves  for  the  number  n  of  test  samples. 
These  curves  show  that  unless  n  is  large,  this  maximum  likelihood  esti¬ 
mate  should  be  carefully  interpreted.  For  example,  if  no  errors  are 
made  on  50  test  samples,  the  true  error  probability  lies  between  0  and 


/ 
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8%  with  probability  0.95.  The  classifier  would  have  to  make  no  errors 
on  more  than  250  samples  to  be  reasonably  sure  that  the  true  error  rate 
is  below  2%. 

The  need  for  data  to  design  the  classifier  and  additional  data  to 
evaluate  it  presents  a  dilemma  when  the  number  of  samples  has  been 
limited.  If  most  of  the  data  is  reserved  for  design,  the  test  will  not  be 
reliable.  If  most  of  the  data  is  reserved  for  test,  the  design  will  be  poor. 
The  questions  of  how  best  to  partition  a  set  of  samples  into  a  design  set 
and  a  test  set  cannot  be  answered  definitively. 

The  technique  that  comes  closest  to  the  true  error  probability  is 
the  "leaving-one-out  method.  "  This  involves  running  the  classifier  to 
train  it,  (design),  using  n-1  samples,  and  testing  it  on  the  remaining  sam¬ 
ple.  The  classifier  is  then  run  n  times,  leaving  each  sample  out  for  a 
given  run.  Thus,  almost  all  of  the  samples  are  used  in  the  design, 
which  should  lead  to  a  good  design.  Also,  all  of  the  samples  are  used 
for  test.  The  problem  with  this  technique  is  that  it  is  only  practical 
when  n  is  quite  small. 

22 

A  practical  compromise  is  the  ir  or  rotation  method  .  In  this 

technique,  a  small  subset  of  P  pattern  samples  is  chosen,  where  l<P«n, 

n/P  is  an  integer,  and  P/n<l/2.  The  classifier  is  trained  on  the  n-P 

remaining  samples,  and  tested  in  the  P  samples.  An  estimate  of  the 

error  probability  P  [*  J  is  obtained  for  the  ith  run.  The  runs  are  made 

6  1 

n/P  times,  using  a  different  set  of  P  samples  each  time  for  test,  and 
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training  on  the  remaining  samples.  The  resulting  approximate  estimate 

of  P  is  calculated  as:  n/P 

e 


E  (P  M  }  =  —  z  PM.,  where 
el  J  n  .  ,  el  V 
x  =  l 


E  {PeM  is  the  expected  value  of  Pe[i*  Note  that  when  P  =  1,  the  method 
reduces  to  the  leaving-one-out  method.  Reference  (21)  also  suggests  that 
a  better  estimate  of  the  true  error  probability  might  be  obtained  by: 

(80)  Pe*  =1/2  [E{Pe[*]}+  E{Pr[R]}] ,  where 

E  { P  [R] }  =  the  estimated  error  probability 
e 

based  upon  training  on  all  of  the  samples  and  testing  all  of  the  samples, 
giving  as  previously  discussed  a  highly  optimistic  error  probability,  but 
a  reasonable  lower  bound  to  the  true  error  probability. 


0  0.1  0.2  OJ  0.4  0.6  0.6  07  0.6  0.9  IjO 


Figure  5.1.  Confidence  Intervals  for  Error- Rate 

Estimates 
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6.0  Tests  with  Simulated  Radar  Data 


Preliminary  evaluation  of  the  various  algorithms  was  conducted 
using  simulated  radar  data  supplied  by  W.  L.  Simkins,  Jr. ,  of  RADC. 
The  data  consisted  of  65.  536  (256  x  256)  samples  in  x-y  presentation. 
Ground  clutter  measurements  for  each  xy  coordinate  in  amplitude  and 
doppler  were  included.  Details  of  the  data  are  described  in  Appendix 
E. 


Test  runs  were  made  to  determine  the  amount  of  processing  time 

necessary  for  each  of  the  algorithms  to  be  evaluated.  Figure  6.0. 1, 

taken  from  Appendix  E,  shows  the  processing  times  as  a  function  of 

the  number  of  samples  for  two  typical  algorithms  -  the  Bayes  classifier 

(3) 

and  the  kNN  algorithm. 

It  can  be  seen  from  Figure  6.0.1  that  even  though  the  kNN  algorithm 
is  evaluated  for  k  =  1,  3.  4,  5,  6,  7,  8,  9,  and  10  in  parallel;  and  there¬ 
fore,  takes  longer  than  a  run  with  only  k  =  1,  for  example,  the  processing 
times  are  too  great  to  allow  all  65,  536  samples  to  be  used.  Table  6. 0. 1 
summarizes  the  processing  times  for  257  samples  to  be  run  for  each  of 
the  algorithms. 


Processing  Times  n  =  257 


Algorithm 

Bayes 

NN 

Hierarchical  Clustering 
Minimal  Spanning  Tree 


Times  (Min) 
0.877 
2.63 
83.497 
2.718 


Table  6.0.1 


1024 

Times  (Min) 
2.8 
38.3 
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Based  upon  these  conclusions,  runs  were  made  to  evaluate  the 
various  algorithms  using  a  sample  set  including  up  to  1,024  samples. 

The  limited  field  of  view  included  the  region  from  x  =  1  through  128, 
and  y  -  128  through  135  as  described  in  Appendix  E.  The  selection  was 
chosen  to  give  a  wide  variety  of  amplitude  and  doppler  values. 

15  runs  were  made  for  each  of  5  sample  sizes,  (60,  135,  255,  510, 

and  1005),  for  the  Bayes  and  kNN  algorithms.  The  ir (Rotation)  method 

was  used  to  evaluate  the  error  probabilities  -  as  discussed  in  Section 

5.0.  Typical  results  for  these  algorithms  is  shown  in  Table  6.0.2  and 

6.0.3.  In  these  tables,  the  first  column  indicates  the  number  in  the 

sample  set.  The  second  column  indicates  the  error  probability  obtained 

by  "testing  on  the  training  set.  "  The  third  column  is  the  expected  value 

of  the  error  probability  as  obtained  by  their  method.  The  fourth  column 

is  the  result  of  applying  equation  (80)  to  columns  2  and  3.  As  discussed 

by  Toussaint  and  Sharpe,  this  yields  a  closer  estimate  of  the  true  aver- 

2  1 

age  error  probability  .  The  fifth  column  in  Table  6.0.3  is  1/2  of  the 
fourth  column,  and  represents  an  estimate  of  the  true  Bayesian  error 
probability. 
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P  ,  [X  J  x.]  ~  .  T.  ,P.[X.,k  lx.  ,]a 
total  1  k  i  3=1  j  J  'i,3 


Bayes  Classifier  -  <*=  0.5 


x,  y.  Amplitude, 

Doppler  -  Doppler  Categories 

Number  of 

Samples 

Pe[Rl 

E{  P  N  } 
e 

P 

e 

60 

0.0500 

0.6167 

0.3334 

135 

0.0296 

0.5111 

0.2704 

255 

0.0353 

0.4157 

0.2255 

510 

0. 1294 

0.5725 

0.3510 

1005 

0.4657 

0.3771 

0.4214 

Table  6.0.2.  Error  Probabilities 


INN  Algorithm, 


x,  y.  Amplitude, 

Doppler  - 

Doppler  Categories 

Number  of 

Samples 

Pe[R] 

E  {P  fir  ]  } 
e 

P 

e 

P  * 
e 

60 

0.1833 

0.2167 

0.  2000 

0.1000 

135 

0.0519 

0.1037 

0.0778 

0.0389 

255 

0.0784 

0.0980 

0.0882 

0.0441 

510 

0.0078 

0.0098 

0.0088 

0.0044 

1005 

0.0010 

0.0050 

0.0030 

0.0015 

Table  6.0.3.  Error  Probabilities 


J 


7.0  Tests  With  Actual  Radar  Data 

W.  L.  Simkins,  Jr.  also  has  supplied  a  test  tape  containing  sam¬ 
ples  of  actual  radar  data.  The  data  consists  of  a  unmber  of  files  con¬ 
taining  P,  $ ,  and  Amplitude  information.  Some  runs  were  made,  using 
the  Bayes  and  kNN  algorithms.  The  results  so  far  obtained  were  com¬ 
parable  to  those  obtained  with  the  simulated  radar  data  discussed  in 
Section  6  and  Appendix  E  of  this  report.  However,  the  number  of  runs 
made  were  insufficient  to  give  definitive  information  about  the  data  and 
how  each  of  the  algorithms  performed. 
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8.0  Summary,  Conclusions  and  Recommendations 

Preliminary  runs  testing  simulated  radar  data  with  a  number  of 
conventional  pattern  recognition  algorithms  indicated  that  the  1-nearest- 
neighbor  nonparametric  algorithm  showed  promise  in  producing  low 
error  probabilities.  As  discussed  in  Appendix  E,  Figure  6.0.2(b),  the 
errors  made  were  primarily  at  the  transitions  between  one  class  and 
another.  This  indicates  that  combining  the  nearest- neighbor  algorithm 
with  a  gradient  technique  to  sense  the  "edges",  or  boundaries  between 
classes  might  produce  fewer  errors. 

The  times  required  for  the  nearest- neighbor  algorithm  to  process 
only  up  to  1,024  samples  are  much  too  large  to  yield  practical  real-time 

processors  in  a  radar.  However,  there  are  techniques  which  can  signi- 

22 

ficantly  reduce  these  times 

Although  runs  were  made  with  the  Minimal  Spanning  Tree  Algorithm, 
the  results  obtained  were  indifferent  at  best.  One  reason  for  this  is  that 
although  the  technique  to  produce  the  minimal  spanning  tree  is  an  efficient 
one,  the  "pruning"  of  the  tree,  as  the  algorithm  is  presently  constituted, 
does  not  allow  a  prior  selection  of  the  number  of  clusters.  Thus,  the  re¬ 
sulting  clusters  -  not  being  under  the  control  of  the  program  -  tend  to  be 
different  from  the  natural  grouping  of  the  data. 

In  the  light  of  these  preliminary  findings,  the  following  recommen¬ 
dations  are  made  for  follow-on  activity. 

-Additional  tests  using  actual  radar  data  comparable  in 
format  to  the  simulated  data  should  be  made  of  at  least 
the  nearest-neighbor  algorithm. 
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Gradient  or  edge- detection  methods  should  be  investigated 
and  incorporated  into  whatever  algorithm  is  employed. 

Investigation  and  incorporation  of  techniques  to  allow  at 
least  two  orders  of  magnitude  of  data  in  real-time  should 
be  pursued,  particularly  for  the  nearest- neighbor  algorithm. 

The  minimal  spanning  tree  algorithm  should  be  modified 
to  allow  the  selection  of  the  number  of  clusters  required 
of  the  data  and  tested  with  simulated  and  actual  radar  data. 

A  fuzzy  k-means  algorithm  should  be  incorporated  into 
the  evaluation  process.  This  technique  offers  some  pro¬ 
mise  ingtjie  type  of  problem  presented  by  radar  ground 
clutter 
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y> 

6  k 
ko(i) 


p(d, v  s) 


p(v  |s) 


P(v  |s)c 


C(s,  d) 


ci-« 

C 

a 


Glossary 


Cluster  center  of  kth  cluster 

Average  distance  from  pattern  i  to  all  cluster  centers 
Distance  between  pattern  i  and  cluster  center  k 
Squared  error  for  cluster  k 
Cluster  center  closest  to  pattern  i 
Generalized  likelihood  ratio 

Binomial  coefficient  indicating  the  combination  of  m 
things  taken  k  at  a  time 

Null  vector 

Joint  conditional_probability  density  function  that  decision 
d  and  waveform  v  will  occur  given  that  signal  s  has 
occurred 

A  priori  joint  probability  density  function  over  all  noise 
signals  n  in  noise  space 

Conditional  probability  density  function  that  a  particular 
waveform  v  will  occur,  given  that  a  signal  s  has  occurred 

Expectation  of  p(vjs)  over  all  signals  s 

All  possible  joint  combinations  of  signal  and  noise  wave¬ 
forms  within  the  observation  interval  in  observation  space 

Average  amplitude  of  waveform 

Quantitative  cost  associated  with  each  point  s  in^  (signal 
space)  and  each  point  d  inA  (decision  space) 

Cost  associated  with  correctly  deciding  a  signal  is  present 

Cost  associated  with  correctly  deciding  a  signal  is  absent 

Penalty  associated  with  deciding  that  signal  is  present, 
when  there  is  no  signal 

Cost  associated  with  deciding  no  signal,  when  there  is  a 
signal  ’ 
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r 


B 

D(d  |  v) 


D 

D 


M 

N? 


H  , 


H. 


LBfo) 

Lc(D|s) 

L(D,  a  ) 


“k 

nf 
p,  Q 


X. 

1 

x11 


Bayes  decision  rule 

Decision  rule  leading  to  a  decision  d,  having  observed 
a  waveform  v 

Minimax  decision  rule 

Neyman- Pearson  decision  rule 

Squared  error  for  a  clustering 

Null  hypothesis  (i.e. ,  that  noise  alone  is  present) 

Composite  alternate  hypothesis  (i.e.,  that  signal  plus  is 
present) 

Initial  choice  of  number  of  clusters 

Bayes  average  loss  for  an  a  priori  distribution  o  (s) 

The  mathematical  expectation  of  the_loss  with  respect  to 
all  possible  decisions  d  for  a  given  s  and  decision  rule  D 

Average  loss  for  a  known  a  priori  probability  density 
a(s)and  decision  rule  D.  The  sum  of  costs  associated 
with  decisions  d  and  inputs  s  weighted  according  to  their 
joint  probability  of  occurrence 

Number  of  patterns  in  cluster  k 

Number  (user  supplied)  to  eliminate  outliers 

A  priori  probabilities  of  signal  present  and  signal  absent, 
respectively 

Decision  threshold 

Threshold  (user  supplied)  for  creation  of  new  cluster 
(number  between  0  and  1) 

Set  of  samples  from  class  i 

Set  of  n  samples 


60 


a 

7T 

p 

6  (S-O) 


(s) 


r 

A 

s  s 

>T 


False  alarm  probability 

Average  missed-detection  probability 

Discrete  probability  distribution  of  s  over  space  ft 
(signal  absent  region)  * 

Probability  of  rejecting  H  when  it  is  true  (level  of  test) 

Parameter  vectors  determining  signal  s 

Estimate  of  parameter  vector 

Lagrange  multiplier 

Mean  vector  of  a  mu  Hivariate  probability  distribution 

Joint  a_priori  probability  density  function  over  all  the 
points  s  in  signal  space 

Variance  of  given  waveform 

Time  delay 

Starting  phase 

Pattern  class  i 

Probability  c  ?nsity  of  s  over  space  ft  j  (signal  plus  noise 
region) 

Observation  space 
Decision  space 

Covariance  matrix  of  a  multivariate  probability 
distribution 

Vector  transpose 
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APPENDIX  A  -  BAYES  CLASSIFIER  PROGRAM 


BAYES 

This  routine  performs  an  approximate  multivariant  Bayes  rule  classi¬ 
fication.  It  also  produces  the  frequency  histograms  for  each  feature  over 
each  category  and  over  all  categories.  Since  the  "true"  probability  dis¬ 
tributions  for  each  feature  are  presumed  to  be  unknown  (if  you  know  them, 
you  may  be  better  off  running  SPSS  and/or  BMD),  the  frequency  histograms 
are  used  in  place  of  the  probability  distributions  in  the  Bayes  classification. 
We  expect  that  considerable  development  of  this  routine  will  be  required 
before  it  is  suitable  for  any  but  very  large  data  bases. 


WORD 

DEFAULT 

DESCRIPTION 

NIN 

(IORIG) 

Input  unit. 

NPNT 

0 

LE  0 
GT  0 

No  action 

Histograms  produced  on  line- printer 

NPRO 

0 

LE  0 

The  a  priori  probability  that  a  given  pattern 
is  a  member  of  a  given  category  is  1.0  for 
all  categories. 

GT  0 

The  a  priori  probability  that  a  given  pattern 
is  a  member  of  a  given  category  is  (number 
of  patterns  in  that  category)  /NPAT. 

NRES 

0 

LE  0 

The  resolution  of  the  histograms  is  1/5  of  the 
number  of  patterns  in  the  smallest  category, 
rounded  to  the  nearest  integer  multiple  of  10. 

GT  0 

The  resolution  of  the  histograms  is  NRES. 
(The  maximum  allowed  resolution  is  NPAT). 

LOSS 

0 

LE  0 

The  misclassification  risk  for  each  category 
is  1.0. 

GT  0 

The  misclassification  risk  for  each  category 
is  user  defined.  See  (1)  below. 

NPROB 

0 

LT  0 

No  classification;  only  histograms  produced. 
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BAYES,  page  2 


EQ  0  Default  prediction  done;  individual  feature 
probabilities  summed  with  a  =0. 5, 1. 0,  2. 0 
and£  ln(prob). 

GT  0  Prediction  summation  rules  defined  by  user. 
See  (2)  below. 

(1)  Misclassification  risk:  Specify  risk  associated  with  each  category  in 
the  following  format: 

i,  r$  where  i  =  index  of  the  category  and  r  *  misclassification  risk 
associated  with  the  ith  category. 

End  risk  input  with  i=0.  All  categories  not  explicitly  defined  have  a 
misclassification  risk  of  1.0. 

(2)  Probability  summation  rules:  Specify  summation  a's  in  the  following 
format: 

i,  «$  where  i  =  dummy  index  and  <*  =  desired  summation  a 
(a  =  0  specifies  £  in(prob)  ). 

End  a  input  with  i  =  0. 

The  following  example  illustrates  this  option: 

To  combine  the  individual  feature  probabilities  using: 

£(prob)’*,  E(prob)'5,  £(prob)2,£  (prob)10,  E  (prob)1/  ^n(prob)  .  .  . 

1,0.1$ 

1,0.5$ 

1,2$ 

1, 10$ 

1,1$ 

1,0$ 

0$ 

Prerequisites:  Category-type  data. 

Hints  and  cautions:  Works  best  on  orthogonal  features,  reduced  down  to 
"meaningful"  minimum  number.  See  KARLOV  and  SELECT.  Histo¬ 
grams  may  be  obtained  on  continuous  property  data. 

References:  Any  numerical  statistics  text  for  Bayes  Classification  Rule. 
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BAYES,  page  3 


SECTION  II:  DEFINITIONS 

1.  RESOLUTION  UNIT _ 

NRES  =  the  number  of  equal- intervals  the  features  will  be  devided 
into. 

=  user  defined 
or 

N  .  /5  (rounded  up  to  nearest  even  factor  of  10) 
mm 

N  .  =  the  number  of  patterns  in  the  smallest  category 
min 


2.  MINIMUM 


MBSL  =  the  smallest  x^  value  in  the  training  set 


3.  MAXIMUM 


MAX.  =  the  largest  x  value  in  the  training  set 
l  i 


.  INCREMENT 


INC.  =  (MAX.  -  MIN.)/ NRES 

l  li 


5.  PROBABILITY 


PROBk  =  the  a  priori  probability  of  a  given  pattern  being  a 
a  member  of  category^.  (Will  either  equal  1.0  or 
(N  /NPAT),  where  N,  =  number  of  patterns  in 
category^.  ) 


6.  RISK 


RISK^  =  the  risk  associated  with  misclassifying  a  pattern  which  is 
in  category  .  (Will  be  1.0  if  not  otherwise  defined  by  the 
user). 
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BAYES,  page  4 

7.  HISTOGRAMS:  NORMALIZED  TO  THIS  SPECTRA  MAXIMUM  100  =  X 


the  frequency  histogram  for  the  given  feature,  category  has  been 
normalized  so  the  most  highly  populated  interval  is  plotted  full- 
scale;  the  actual  number  of  patterns  in  the  full-scale  interval  is  X. 


8.  HISTOGRAMS:  NORMALIZED  TO  CATEGORY  SPECTRA  MAXIMUM 

100  =  Y 


the  frequency  histogram  for  the  given  feature,  category  has  been 
normalized  to  the  most  highly  populated  interval  of  the  NCAT  cate¬ 
gory  histograms  for  the  given  feature.  This  normalized  plot  is 
presented  to  allow  easy  comparison  of  histograms  for  all  cate¬ 
gories  of  a  given  feature. 


9.  HISTOGRAMS:  SUMMATION  OF  ALL  CATEGORIES 


the  frequency  histogram  for  the  given  feature  without  regard  to 
possible  categories;  the  over-all  feature  distribution. 


10.  HISTOGRAMS:  NORMALIZED  TO  FEATURE  SPECTRA  MAXIMUM 

100  =  Z 


the  over-all  feature  frequency  histogram  has  been  normalized  to 
the  most  highly  populated  interval  of  the  NVAR  over-all  feature 
histograms.  This  normalization  is  presented  to  allow  easy  com¬ 
parison  of  histograms  for  all  features. 


11.  PROBABILITY  CALCULATED  WITH  SUM  OF  .  .  . 


the  Bayes  Theorem  probability  that  a  given  pattern  is  a  member  of 
category,  ,  using  feature,  and  its  associated  probability  distribution, 
is  given  by: 


P.[X 


j.k 


x.  .]  =  (PROB,  )  (RISK.  )  Pfx.  .  X.  ] 
i»  J  k  k  1  i,3  3,  k 

NCAT 

l  (PROB  KRISK  )P[x.  .  X.  ] 
n  n  1  i,j  j, n 


n=l 
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1.  continued 


P[x.  .  |X.  ]  =  the  value  of  the  probability 

J-.J  J<n  distribution  for  (feature., 
category  )  at  the  value  <3f 


given  that  the  categories  are  mutually  exclusive  and  that  the 
probability  of  a  pattern  being  a  member  of  some  category  of  the 
training  set  is  1.0. 

The  probability  distributions  as  represented  by  the  frequency 
histograms  are  not,  unfortunately,  continuous  (there  may  be 
completely  empt  intervals  surrounded  by  high-frequency  re¬ 
gions.)  This  makes  for  difficulties  in  attempting  a  straight¬ 
forward  multiplicative  multifeature  probability  estimate.  We 
have  chosen  to  combine  the  single  feature  probabilities  using 
less-sensitive  (but  empirical)  rules*. 

1.  FEATURE  PROBABILITIES  RAISED  TO  THE  a  POWER: 

P.  ,  JX  |x.l  =  |x  .]a 

total1  k'  l  J,k‘  i,j 


2.  LN  FEATURE  PROBABILITIES  (essentially  a  multiplicative 
combination). 


■  ™plxj,k 
j=l 
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SECTION  III:  IMPLEMENTATION 
1.  Subroutines: 

BAYES:  driver 

INP1BA:  input 

INP2BA:  input,  risk  and  a  arrays 

SCALBA:  scale  parameters 

OUTiBA:  output,  working  parameters 

CLPSBA:  continuous  feature  values  to  discrete  integers 

FQGNBA:  creates  and  stores  frequency  distributions 

HSHQBA:  driver  for  histograms 

HSTWBA:  line- printer  histograms 

PROBBA:  Bayes-rule  probabilities  for  each  category 

PREDBA:  multi-feature  probability  summations  and  out 

OUT2BA:  output,  pattern  classification  results 

OUT3BA:  output,  result  summary 

INACBA:  interactive  terminal  driver 


2.  Organization: 


BAYES 


r 


APPENDIX  B  -  FORGY/ JANCEY  PROGRAM  - 
SQUARED- ERROR  PROGRAM 


Subroutine  EXEC 


subroutine  execilliniti 

C  THIS  SUBROUTINE  RE AOS  PARAMETERS.  COMPUTES  STORAGE  ANO  CALLS  MAJOR 
C  PROGRAM  SEGMENTS  NEEOEO  TOR  A  N0N-M1ERAAI.HICAL  CLUSTERING  JOG  USING 
C  ON t  OP  THE  METHODS  PROGRAMMED  AS  A  VERSION  OP  SUBROUTINE  *KNEAN*. 

C  EVERT  JOB  REQUIRES  THREE  USER  SUPPLICO  DECK  SEGMENTS. 

e  I.  PROGRAM  *ORIVCR*  PERFORMS  THE  FOLLOWING  TASKS. 

C  A.  ASSIGNS  INPUT/OUTPUT  UNITS. 

C  G.  ESTABLISHES  THE  DIMENSION  OP  THE  ARRAV  ANO  SETS  THIS 
C  DIMENSION  TO  'LIMIT* 

e  C.  CALLS  SUBROUTINE  •UIC>. 


THE  FOLLOWING  EXAMPLE  WILL  SUFFICE  IN  MOST  CASES. 

PROGRAM  DR I VER I INPUT .OUTPUT .PUNCH. TAPES* INPUT . T AREA-OUTPUT , 
GTAPET*PUNCH«TAPEI. TAPES) 
dimension  xisetti 
LIMIT-SOBO 
CALL  CXECIX.LINITI 
END 

S.  SUBROUTINE  *USER*  IS  EMPLOVEO  TO  REAO  THE  COMPLETE  SET  OF  SCONES 
ON  THE  VARIABLES  FOR  ONE  DATA  UNIT.  THE  FOLLOWING  EXAMPLE 
ILLUSTRATES  VARIOUS  POSSIBILITIES  FOR  NEROING  FILES  ANO 
TRANSFORMING  VARIABLES  AS  THET  ARE  READ. 

SUBROUTINE  USERIXI 
DIMENSION  XIBI 
READ (1.100)  XITI.T 
RE*0(2>  (XII) .l*t.G) 

REA0IS.2P0)  X«B).Z 
XI3>-.S*X<3)  ' 

X<7I*3.«*X(T) 

X (Sla.**X(G).«3S*Y*.2S*Z*XtGI 
RETURN 

IBB  FORMAT I2F| |.3I 
EGG  FORMAT (FB.I.FG.3) 

ENO  • 

3.  FUNCTION  »01ST*  COMPUTES  THE  DISTANCE  BETWEEN  TWO  RATA  UNITS  ON 
BETWEEN  A  OATA  UNIT  ANO  A  CLUSTER  CENTROIO.  THE  USER  CAN  SPECIFY 
ANV  OESIREO  DISTANCE  FUNCTION  AND  WEIGHT  THE  VARIABLES  IN  ANY 
MANNER.  THE  FOLLOWING  EXAMPLE  ILLUSTRATES  A  WEIGHTEO  SQUARED 
EUCLIDEAN  DISTANCE  BETWEEN  TWO  OATA  UNITS  DENOTED  AS  X  ANO  V. 

THE  PROBLEM  INVOLVES  S  VARIABLES  AND  THE  WEIGHTS  ARE  IN  THE 
•W*  ARRAS. 

FUNCTION  01STIX.T) 

DIMENSION  XtD.VIII.Wtai 

OATA  IWl!)»l*l*G)/3*l..3.«A.S.t.*t*l«/ 

OIST-B. 

00  IB  1*1. G 

IB  OIST-DIST*W<l)*(IXtll>V(I)l**» 

RETURN 


NOTE  THAT  SCALING  ANO  TRANSFORMATION  OF  VARIABLES  CAN  BE  { 

ACCOMPLISHED  EITHER  IN  SUBROUTINE  ‘USER*  OR  IN  SUBROUTINE  *OIST*. 
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INPUT  SPECIFICATIONS 
CARO  I  HUE 

CAPO  2  PARAMETER  CAPO 

cots  I-  S  h*>NL'MHr«  or  FNT Jilts  (OATA  UNITS! 

Mv*Nu«»f»  or  vA'ii«ntCS 
N Of  CLUSTERS 
NTlN-lKPuT  UNIT  f  oa  THC  DATA  SCT 
HTlw-St  CAPO  PE A OF P 
NfIM.Nf.S*  TAPE  OM  DISK  flit 

NTOUf-OUTPUf  UNIT  FOP  SAVING  CLUSTER  MEMBERSHIP  LISTS 
Nf OUT*7»  CAPO  PUNCH 

NTOOT.LF.  .C*  UO  HJT  SAV£  MEMBERSHIP  ll$T$ 
N|NR£t«HPMlNAMON  P AP A*C T5R  •  CLUSTERING  ENOS  THEN  ft 

CYCLE  THROUGH  THE  DATA  SLT  RESULTS  |N  *N|MEL# 
OP  rent*  CHANGES  IN  CLUSTER  MEMBERSHIPS 
RIHREL.LE.it  IIEPaTE  10  COMPLETE  CONVERGENCE 


COLS  6*10 
cots  ll-IS 
cots  iwo 


COLS  II-Z9 


COLS  tCrOR 


COLS  W-W 


COLS  3MI 


|PART«1NIT|AL  PARTITION  PARAMETER 

IRARf SEEO  POINTS  A«E  SEtfCTro  FROM  THE  OATA  UNITS* 
ME  AO  THE  SEQUENCE  NUMBER*  FOR  Th£  CHOSEN  OATA 
UNITS  fftOM  CARO'S)  3  IN  2014  TORna?.  If  THE 
OAT  A  SET  IS  NOT  STORED  IN  CORE*  THE  ttST  Of 
or  SLOUfNCE  NUMBERS  nuST  et  IN  ASCENDING  ORDER 
IPART«2«  THE  DATA  UNITS  ARE  GROUPED  INTO  AN  INITIAL 
PARTITION  IN  THE  INPUT  SEOUCNCC  NlTH  IhC 
FIRST  RNUNBRfll*  IN  CLUSTER  It  TH£  NEXT 
•NUMpR f  2)  •  IN  CLUSTfR  2  ETC.  REAO  THE 
•NUMQR*  ARRAY  FROM  CARD'S!  3  IN  201 A  FORMAT* 
IPLRTO'  THC  SCORE  VECTORS  FOR  T«E  SEED  POINTS  ARC 

REAO  TROM  Caro  I S I  4  IN  FORMAT  •FHT*  WHICH  IS 
REAO  FROM  CARD  3. 

METMOO*PA«AMfiER  FOR  CHOOSING  THE  ALGOR ITHN  IN  ONE 
VERSION  OF  SUBROUTINE  •KMCAN*. 

METHOD* I •  JANCCY  ALGORITHM 
HETHOO.NE.lt  FORGY  ALGORITHM 


C 

(•••CAROS  3  and  ♦  Arc  »r*0  IN  SUBROUTINE  •WAN*  ACCORDING  TO  THE 
C###PROCCDURE  SPECIFIED  BY  THE  CHOSfN  VALUE  OF  «!PAR1«.  NOIC  THAT  THE 
(•••BASIC  K-HEANS  Ht  f HOD  Of  MACOUEEN  SIMPLY  USLS  The  FIRST  *NC«  OATA 
(•••UNITS  AS  CLUSTER  SEEO  POINTS  ANO  THEREFORE  IGNORES  TM£  • IP ART* 
(•••PARAMETER. 


STORAGE  ALLOCATIONS  IN  THE  •  !•  ARRAY 

K (Ml  I  TO  XIN2-1I  NC*NV  VOROS— STORAGE  OF  THE  CENTR  ARRAY 
NC  KOROS— STOWAGE  OF  THE  NUHBR  ARRAY 
N£  WOPOS— ' STORAGE  Of  THE  NENftR  ARRAY 
NC*NV  VCROS— STORAGE  OF  THE  TOTAL  ARRAY 
NV  OR  ME»HV  YOPOS—  STORAGE  OF  THE  OATA  ARRAY 
ME  NOROS— STOWAGE  OF  THC  LIST  ARRAY  IN  •RESULT* 


X  IN?)  TO  X(N3-!> 
XIN3I  TO 
XIN4I  TO  X'NS-II 
X (NS)  TO  XIN6) 

TO  X4N7) 


JMN*> 


DIMENSION  XIII *T|TLC(2tl 
RCAO(StlOOO)  TITLE 

RCAO(StllOO)  NE tNVtNCtNT iNtMTOUT *MINREL* IP ART vMCTMOO 
V«tTC<6«2C00t  TITLE 

VRITEU'2100)  NE.NY.NCtNTlNtNTOUTtMINRELtlPART.HCTHOO 
Nl*| 

N?«N|«NC*NV 

M3-N2«NC 

N4*N)*N( 

NS*N4«NC*NV 


c  *NA«  may  BE  INCREASED  IN  *KMCAN** 

NA«NS«NV*| 

M7*N4»NE*I 

MAX*N6 

IMN7.GI.NAXI  MAlaMT 
VRITE (6*22001  MA1,LIM|T 
IMMAX.GT.MNlT)  STOP 

CALL  KHEAN'XINII.X  <N2I vA(N3)«A (M4I *K(N9I tMStNEt HV«MC*NT|McN|MRCL4 
A|PART«hCThOD#L IM| it 

CALL  RESULT II(NI)tX(M2l tX <H3I «X (N*l »T ITLCtNE iNVtNCtMTOUT) 

•FT U»M 

1000  FORMAT (?9 J4I 
1100  rORN4T(BIS> 

2000  FORMAT  I 1  HI 1 20 A4I 

2100  FORMAT  *SH0NE  •tfRt/tSH  NV  •• I R #/ *SN  NC  •* !••/» TH  NT IN  ••I4«S« 

ABN  NIOUT  ••l5t/t4H  HINPIL  -.l*./.®*  (PART  •*ISt/«4M  M(fHOO  M*|AI 

2200  FORMAT  URHOHEOUIRCO  SI0m»oc  •tlS.AH  tfQROSt/# 

A  I4N0ALL0TH0  STORAGE  •tIStSN  fcOROSI 

FNO 
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Subroutine  RESULT 


sun* OUT  INC  RESULT ICENTR .NUNBR ,MENBR .1 1ST  .T I TIC 'NC .NV.NC.NTOUT t 
C  THIS  SUBBOUT  1  ME  PRINTS  THE  RESULTS  FROM  A  CLUSTERING  JOB  MKt 
C  ON  ANY  VERSION  OF  SUBROUTINE  •KNEAN*. 

c 

DIMENSION  CENTRI 1 1 tNUMBR ( 1 ) .NEM6R I I I .LIST 1 1 1 (TITLE  1201 

e 

C  AS  A  CONTINGENCY  PRECAUTION  WRITE  OUT  THE  RAW  MEMBERSHIP  LIST. 
WRITEIB.20C0)  TITLE 
VRITEI6.210!))  ixCKBR (Kl «K«I .NCI 
WRITE (6*2200 1  INUM8R « J» • J»l »NCI 

C  INVERT  THE  •MEMBRR  ARRAY  ANO  PUT  THE  RESULT  IN  THE  M.  1ST*  ARRAY. 
C  FIRST  REVISE  THE  »NUNBR»  ARRAY  TO  CONTAIN  START  POINTS  IN  THE 
C  "LIST*  ARRAY  FOR  EACH  CLUSTER 
NUNBR I NC> .NE-NUNOR I NC I • 1 
JJ.NC 
JJI-JJ-I 
00  10  J-2.NC 

NUNBR I JJI I *NVMBR I  All -NUNS* I JJI » 

JJaJJl 

1#  JJl.JJ-l 
C  BUILD  ‘LIST*  ARRAY 
DO  20  K.I.NC 
NEMBRK.MEMBRIKI 
N J.NUMBR (NCMBRK I 
LIST INJI.K 

NUNBR I NCMBRK I » NUNBR  C  NCMBRK 1*1 
r*  CONTINUE 

C  SAVE  the  SORTEO  MEMBERSHIP  LIST  IF  OCSIRCO 
IFINTOUT.LE.al  GO  TO  30 
VRITEINTOUT, 30001  TITLE 
WRITE  INTOUT. 31001  IL 1ST IKI *K»1 .HE! 

C  RESTORE  THE  •HUMOR*  ARRAY 
SO  JJ«NC 

DO  00  J-2.NC 

NUNSR  I  JJI  -NUNBR  I  JJI  -NUNSR  (API » 

H  JJ-JJ-1 

NUNBR 1 1 1 -NUNBR 1 1 1 -I . 

C  HUNT  RESULTS  FOR  EACH  CLUSTER 
WRITCIG.2GSGI  TITLE 
*!•!  / 

00  SO  J-l.NC 

WRITE  16.23001  J.NUMBR! Jl 

J|a|J-ll*NV 

WRITE 16.2000)  ICENTR I J| •! I • I«1 .NVI 

R2>K|*NUMBRIJ|-I 

WRITE IA.2S00I  ILISTTRI .*•*! *N>I 

K|aK2*| 

SO  CONTINUE 
RETURN 

* 000  FORMAT! INI «2QAAI 

2100  format I20H0RAW  MEMBERSHIP  LIST,/. 111.251911 
MOO  FORMAT  I  IRHOCLUSTCR  SI  ICS./.  M  X.2S15I  > 

2300  FORMATIBHOCLUSTCA.I3.9H  CONTAINS. IS. I  IN  DATA  UNITS! 

2A0B  FORMAT I2IHOCCNTROIO  COORDINATES./.  II*. I0EI2.AM 
2S0B  FORMAT  I I6M0NENBERSHIP  LIST./. I IA.29ISI I 
3000  FORMAT I20A4I 
SIM  FORMAT I2SI0I 
CNO 
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Subroutine  KMEAN 


Version  I 

SUBROUTINE  KN£ANICEN1».NUNKA.MEMBR.T0TAL.DATA.N5.NC.NV.NC«NTtN» 
AHINRCL • |PAR| .MCTMOO.L lx  1 1 1 

c 

c - - - - - - - - - . . 

C  VERSION  I.  THE  OATA  SET  IS  STOREO  IN  CENTRAL  MEMORY. 

c - - -  ,  . . . 

c 

C  THIS  SUBROUTINE  ITERATIVELY  SORTS  data  UNITS  INTO  CLUSTERS 

C  USING  THE  ALGORITHM  OF  IMETHOO.NE.il 

c 

C  FORGV.  I.V.I  CLUSTER  ANALYSIS  OF  MULTIVARIATF  OATA.  EFFICIENCY 
C  VERSUS  INTERPRET ABIL ITT  OF  CLASSIFICATIONS.  PAPER  PRESENTEO  AT  THE 
C  BIOMETRIC  SOCiETT  (VNARI  MEETINGS.  RIVERSIDE.  CALIFORNIA.  JUNE 
C  IBGS.  ABSTRACT  IN  BIOMETRICS.  VOLUME  21.  NUMBER  3.  R  TAB. 

c 

C  OR  THE  ALGORITHM  OF  (METHOOmII 

c 

C  JANCET.  R.C..  MULTIDIMENSIONAL  GROUP  ANALYSIS.  AUSTRALIAN  JOURNAL 
C  or  BOTANY*  VOLUME  IA.  NUMBER  1.  APRIL  1166.  RR  I2T-130. 

c 

C  CENTRINV»IJ-I >»I)«SCORE  ON  I-TM  VARIABLE  FOR  J-TM  CLUSTER  CCNTROtO 
C  TOTAL(MV»(J-l>»I». TOTAL  SCORE  ON  I-TM  VARIABLE  FOR  OATA  UNITS  THUS 
C  TAR  ALLOCATCO  TO  THE  J-TM  CLUSTER 

C  NUMMRIJI .NUMBER  OF  OATA  UNITS  THUS  FAR  ALLOCATED  TO  THE  J-TH  CLUSTER 
C  NENBRIK) .CLUSTER  TO  MH|Sm  THE  K-TH  OATA  UNIT  CURRENT!*  BELONGS 
C  OATA INV» IK- 1 )» 1 1 .SCORE  ON  I-TH  VARIABLE  TOR  K-TH  OATA  UNIT 
C 

DIMENSION  CENTR 1 1 1 .TOTAL I II.NUMBRI I I.NCMBRII I.DATAIlt  »PNVI2Gt 
A .NAME C4I 

OATA  INANE! 11.1*1 .4 I /AH  F.4H0RGT.4H  JA.4HNCCT/ 

I-l 

IF  INETH00.E0.il  1-3 
MR  I TE 16.20  00)  NAME  1 1 1 .NAME  1 1*1 1 
C  CHECK  FOR  SUFFICIENT  STORAGE 
N6>NS*NE*NV-| 

MR  I TE 16.2100)  N6.LINIT 
IF IN6.GT .LIMIT)  STOP 
C  ESTABLISH  INITIAL  PARTITION 
irilPART.NE.3l  GO  TO  EG 
C  SEEO  POINTS  ARE  PE AO  OIRECTLT  FRON  CAROS 
•E*0  IS. 1000)  FMT 
MR | TE 16*22001  FMT 
MRITEI6. 23001 
JI»0 

DO  10  J-l.NC 

*EAO IS.FMT )  ICENTRIJI* I )•!•!. NVI 
WRITE  16.2400)  ICENTRIJI. I). I*|, NVI 
If  J|mJ1*NV 
GO  TO  3t 
C  IPART-I  OR  2 
ft  MRITE I6.2S00I  IRART 

REA0I5.II 001  INUMBRI Jl .  Ja| .NCI 
MR1TEI6.2600I  INUMBRIJI . J.I.HCI 
C  READ  THE  DATA  SET  INTO  CENTRAL  NENOftV 
30  Kl-l 

CO  At  Ral.NE 

CALL  USER  IOATAIRIII 
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M  K|aK|«NV 

imi-a-t.cooi  oo  ro  10* 

C  If  »IW1*  IS  1  OR  2  SET  UR  THE  SECO  POINTS 
IPI|PART.CO.?>  GO  10  *0 

C  IRART-l.  THE  0*1*  UN11  WITH  SEQUENCE  NUMBER  -NUB0R4JI-  1  BUSED  AO 
C  INC  J-TM  SEED  *01 Ml 
DO  90  J.I.NC 
NJXMUMBPI  JI-II*NV 
ji.(j-i»»nv 
00  50  1-1 .N* 

CENTR<J1.I>-DATAINJ*I> 

M  CONTINUE 
00  10  100 

C  IRART-Z.  INC  0*1*  UNITS  ARC  OROURCO  INTO  CLUSTERS  VtTH  THE  >*TN 
C  CLUSTER  HAVING  •NUHMIJi*  MEMBERS. 

•0  R-0 

JI—NV 

C  *CCUNUL»IE  THE  TOTAL  SCONE  ON  C*CM  VAR I* NET  POP  EACH  CLUSTER 
00  SO  J-I.NC 
NJ-NUNMIJI 
JlaJl'NV 
00  TO  1-laNV 
*0  TOTALIJI*ll-B. 

00  OO"  KJ-I.NJ 
«■*•» 

M!<MRI«|tJ 
RI-IR-IMNT 
00  00  I»l tNV 
J?a  J| « I 

TOTAL IJ?)aTOT*L(J?l«OATA|Rl«ll 
M  CONTINUE 
C  COMPUTE  THE  CENTROIDS 
JI-0 

00  90  J-I.NC 
00  90  l-I.NV 
JI-JI-I 

CCNTRIJI  l-TOTALIJU/MUNBRIJI 

*o  continue 

60  TO  US 

C  INITIALIZE  arrays 
100  DO  110  Ra| »NE 
Ilf  MCRBRINIaO 
IIS  NRASS-I 

C  DE61NNING  OP  MAIN  LOOP 
110  JI-0 

DO  130  J-l.NC 
NUMBRIJi-t 
00  130  1-ItNV 
Jt-JIM 

ISO  TOTAL! Jll—Oa 

MOVES-* 

TO I ST-0 

C  ALLOCATE  EACH  DATA  UNIT  TO  TNE  NEAREST  CLUSTER  CENTROID 
*1-0 

00  160  K-l.NE 

Rt-KI-t 

JZ-I 

C  COMPUTE  OISTANCC  TO  PIRST  CLUSTER  CENTROID 
ORCP-OIST (DATA IR2I .CCNTRI JZI I 
JREP-l 

C  TEST  DISTANCES  TO  REMAINING  CLUSTER  CENTROIDS 
00  1*0  J-Z.NC 
JZ-JZ-NV 

OTCSTaOIST (DATA!*?) tCCNTRI JSJS- 
IPIOTCST.6E.ORCP>  00  TO  1*0 
ORCPaDTEST 
JNEP-J 
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i«o  continue 

C  ALLOCATE  DAT*  UNIT  10  CLUSTER  •JNEF* 

numbr i jBt  r • .numbr ( jnef I *1 
TOISI.TOIST.OREF 
IFIJREF.EQ.MEMBRIKH  00  TO  15B 
C  THC  data  UNIT  CHANGES  ITS  NC«3FRSH|R 
NOVES.MOVESM 
N£MBR|K).JR£F 

ISO 

00  loo  l«l.Nv 

JI«J1«1 

KI>K|*| 

TOTAL  (Jl  I  .TOTAL  <JH<OATA<KI  I 

laa  continue 

C  ALL  data  UNITS  ALLOCATED.  TEST  r0»  CONVERGENCE 
WAITE  16.27001  MOVCS.NPASS. TOiST 
NPASS.NPASSM 
JR£F«0 

If IN0VE5.GT .NlNAELl  GO  TO  105 

ir  (METHOD. ME. 1 .AND. MOVES. CO. 01  AETUAN 

JPEf.l 

C  COMPUTE  TRUE  CLUSTER  CENTROIDS— FOROY  UPDATE 
ITO  J1*0 

00  IPO  JP|«NC 
DO  l«0  1-I.NV 
Jl>Jl«l 

iao  centr(jii«total:jii^nun8RU» 

IMJREf.EO.il  At  TURN 
GO  TO  120 

IBS  tf(MF.THOD.NC.II  00  TO  l?t 

C  jancey  update 
190  J1«0 

00  200  J-I.NC 
00  200  I-I«NV 
JI«J|»I 

200  CENrR<J|»«2.«T0TAt<JI»/NUR»*«J»-CENT*TJl» 

GO  TO  120 
1000  fORNAT (20A4I 
1100  f ORHAT  T 20 1*1 

2000  FORMAT I1H0 «  2  a*.  53H  NETMOO  OF  CLUSTER  ANALYSIS.  OATA  SET  STORE*  | 
AN  CORE I 

2100  FORMAT II9HOHEOU1AEO  STORAGE  ..I5.6H  WOROS./. 

A  1RH0 ALLOTTED  STORAGE  ■•IS.6H  WORDS I 

2200  fORMATITHOFORHAT.ZOAAl 

2330  FORMAT  I  *JH| INITIAL  CLUSTER  CENTERS  READ  IN  AS  FOLLOWS///* 

2*00  FORMAT <|X.10E12«*f 

2500  format  I  9H1  IP ART  ..12.  SON.  NUMBA  ARRAY  RE AO  AS  FOLLOWS///* 

2600  FORMAT I1X.I01TI 

2700  FORMATIIHO.1S.3TN  DATA  UNITS  MOVED  ON  ITERATION  NUMBER. I 3«/« 

A3AH  SUMMED  DEVIATIONS  ABOUT  SEEO  POINTS  •iElt.ll 
CNO 


Version  2 

SUBROUTINE  MME AN (CENTR.NUMBR.MEMBR, TOTAL. OATA. NS. NC.NV.NC.NTIRv 
ANINREL. IPART .NET HOD. LIMIT  I 


VERSION  2.  THE  DATA  SET  IS  STORCO  ON  A  TAPE  OR  DISK  FILE  VNICN  IS 
RCVOUNO  ANO  RE  AO  IN  ITS  ENTIRETY  FOR  EACH  CYaC. 
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APPENDIX  C  -  HIERARCHICAL  CLUSTERING  PROGRAM 

HIER 

This  routine  produces  a  "dendrogram"  which  describes  the  hierarch¬ 
ical  clustering  (sometimes  known  as  "Q-mode  clustering")  of  the  NPAT 
training  set  patterns.  The  dendrogram  connects  groups  of  patterns  at 
levels  of  similarity.  It  may  be  used  to  group  the  patterns  into  a 
given  number  of  clusters  as  well  as  to  indicate  how  many  clusters 
there  are  at  a  given  similarity  level.  Similarity  is  defined  from  the 
interpattern  distanc  es . 


WORD 

NIN 

IWAIT 


IPULL 


DEFAULT 

(IORIG) 

0 


0 


DESCRIPTION 

Input  unit. 

LE  0  Every  pattern  is  given  equal  weight  in  deter¬ 
mining  the  linkage  levels,  regardless  of  the 
size  of  the  group  of  which  it  is  a  member. 

GT  0  Every  group  is  given  equal  weight  in  deter¬ 
mining  the  linkage  levels,  regardless  of  how 
many  patterns  are  contained  in  the  group. 

LE  0  The  number  of  sections  in  which  the  dendro¬ 
gram  is  printed  is  determined  by  the  routine. 

GT  0  The  dendrogram  is  printed  in  IPULL  sections 
(maximum  of  3). 


Prerequisites:  The  distance  matrix  must  be  present  on  NIN.  See  DIST. 
Hints  and  Cautions:  Only  the  first  NPAT  patterns  will  be  clustered.  Be 
sure  that  you've  defined  the  "training  set"  to  include  all  patterns  you  are 
interested  in  clustering.  (The  algorithm  implemented  in  this  program  uses 
some  computational  "tricks"  to  reduce  run  time.  The  clusters  will  be 
nearly  the  same  as  those  formed  by  truly  hierarchical  clustering,  but  the 
levels  of  smilarity  may  differ. ) 
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SECTION  II:  DEFINITIONS 

1.  SIMILARITY 


S  =  1.0  -  D.  .  /  DMAX 
i,  J  LI 

DMAX  =  the  largest  D.  . 

U  J 


in  the  distance  matrix 


2.  EQUAL  SAMPLE  WEIGHT  PAIR- GROUP  METHOD  OF  CLUSTERING 

g  »OTJMl.oldKSl.o1d)  +  (NUM2.old)(S2.o1d> 

MW  NUMl,cld  +  NUM2,old 


NUMi,old  =  number  of  patterns  grouped  into 
cluster  represented  by  S^^  Qld 

S  .  =  groups  chosen  to  be  clustered  this  cycle 

i,  old 


3.  EQUAL  GROUP  WEIGHT  PAIR- GROUP  METHOD  OF  CLUSTERING 


Snew  =  Sl,  old  +  S2,  old 
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SECTION  III:  IMPLEMENTATION 

1.  Subroutines: 

HIER:  driver 

INPUHI:  input 

FILEHI:  file  initialization 

GROUHI:  clustering 

RECAH1:  distance  recalculation 

DENOHI:  dendrogram  formation 

COORHI:  lineprinter  coordinates  for  dendrogram 

NAMEHI:  pattern  identifiers  read  into  arrays 

PRINHI:  output,  dendrogram 

OUT  PH  I :  outpu  t 

2.  Organization: 


INPU  fFILE~1  GRQU |  j REC A j  |DENOl  [ COORj  [NAME!  ^PRIN }  JOUTP 


APPENDIX  D  -  MINIMAL  SPANNING  TREE  PROGRAM  - 
GRAPH-THEORETIC  METHOD 


TREE 

This  routine  generates  a  minimal  spanning  tree  over  the  training 
set  patterns.  The  spanning  tree  is  then  evaluated  ("pruned")  for  self- 
consistent  clusters  of  patterns.  The  algorithm  used  is  that  of  Prim 
and  Dijkstra,  as  implemented  by  Whitney.  The  original  program  was 
written  by  Dr.  Rex  Page,  Department  of  Computer  Sciences,  Colorado 
State  University. 


WORD  DEFAULT  DESCRIPTION 

NIN  (IORIG)  Input  unit. 


NPNT  0 


NIT  0 


LE  0  No  action. 

GT  0  All  nodes  of  the  spanning  tree  are  listed 
as  the  tree  is  constructed.  If  a  diagram 
of  the  tree  is  desired,  this  information  is 
necessary. 

LE  0  The  spanning  tree  will  be  pruned  once, 

with  D=3,  FACTOR=2,  and  SPREAD=0.  0. 

GT  0  The  spanning  tree  will  be  pruned 
according  to  user  definition  of  D, 
FACTOR  and  SPREAD.  (  See  1  below) 


(1)  Pruning  parameters  .  .  .  Specify  the  evaluation  parameters  with 
the  following  format: 

D,  FACTOR,  SPREADS 

where  D  =  the  number  of  edges  allowed  between  patterns  for  patterns 
to  be  "nearby"  one  another. 

FACTOR  =  Factor  times  the  average  length  of  "nearby"  edges  for  edge 
to  be  inconsistent. 


i 
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SPREAD  ■  Factor  times  standard  deviation  of  "nearby"  edge  lengths  for 
edge  to  be  inconsistent. 


Prerequisites:  None 

Hints  and  Cautions:  For  unbiased  clustering,  use  autoscaled  data. 
References:  Harry  C.  Andrews,  INTRODUCTION  TO  MATHEMATICAL 
TECHNIQUES  IN  PATTERN  RECOGNITION,  Wiley-Interscience,  New  York, 
1972. 

SECTION  II:  DEFINITIONS 

1.  NODE 


Pattern. 


2.  NEIGHBORS 


The  patterns  linked  to  the  given  pattern  during  the 
construction  of  the  minimal  spanning  tree. 


3.  DISTANCE 


The  Euclidean  distance  between  the  given  pattern  and 
its  given  neighbor. 


4.  CLUSTER  (N) 


The  nth  cluster  found,  searching  from  "trunk"  out,  u^ing 
the  given  pruning  parameters. 
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SECTION  III:  IMPLEMENTATION 


Subroutines: 

TREE: 

driver 

INP1TR: 

input 

GROWTR: 

formation  of  minimal  spanning  tree 

DISTTR: 

two-pattern  distance  calculation 

LASTTR: 

pointer  to  last-found  node 

OUT1TR: 

output,  optional  intermediate 

INP2TR: 

input,  pruning  parameters 

CLUSTR: 

prunes  tree 

CLIMTR: 

tree  search  (in  conjunction  with  CLUSTER) 

FINDTR: 

locates  node 

STORTR: 

stores  cluster 

OUT2TR: 

output,  detailed  cluster 

FEXMTR: 

puts  termination  flag  into  cluster  array 

OUT3TR: 

output,  compact  cluster 

Organization: 

APPENDIX  E 
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-  "THE  COMPARISON  OF  A  BAYESIAN  CLASSIFIER  AND 

A  K-NEAREST  NEIGHBOR  STATISTICAL  PATTERN 

RECOGNITION  TECHNIQUE  AS  APPLIED  TO  RADAR 

GROUND  CLUTTER,  " 

M.S.  Thesis  -  A.  A.  Fraser 
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ABSTRACT 


This  paper  presents  a  comparison  of  the  Bayes  and  the  k- 
Nearest  Neighbor  statistical  pattern  recognition  algorithms.  The 
first  half  of  this  presentation  is  a  detailed  analysis  of  both  techniques 
and  it  also  gives  a  description  of  the  actual  algorithms  used. 

Simulated  radar  ground  clutter  information  was  available  for 
analysis.  A  description  of  the  data  subject  to  analysis  is  also  pre¬ 
sented. 

The  error  rate  of  these  classification  algorithms  was  the  chief 

criterion  used  for  the  evaluation  of  performance.  The  second  half  of 

the  paper  discusses  the  various  error  evaluating  techniques  that  are 

feasible  for  evaluation  of  the  performance  of  the  algorithms.  Because 

7 

of  economics,  time  consideration,  and  other  factors,  the  n- method 
was  chosen  to  measure  error  rate. 

Results  showed  that  the  nonparametric  Nearest  Neighbor  tech¬ 
nique  gives  a  much  smaller  error  rate  than  the  parametric  Bayes 
technique  for  the  given  data  type.  The  results  are  justified  in  the 
conclusion. 
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Symbol 
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Page 

C(S.  IS.) 
k  1 

The  cost  of  deciding  class  S. 
when  is  actually  present. 

-- 

E  26 

D(d/v) 

Pr  {  Ma  king  decision  d  given  v  } 
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E  7 

E  {  } 

Expectation  operator 

E  20 

L(x,  S  ) 
k 

The  average  loss  associated  with 
class  given  pattern  x. 

-- 
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Mk 

The  number  of  samples  in  class  k 

-- 

E  15 

N(  K  *  ) 

Representation  of  a  normal  distrit"+’on  -- 

E  22 

Pr 

Probability  operator 

P  >!< 

The  Bayesian  error  rate 

-- 
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P  *  (e/x) 

Error  associated  with  classifying 
pattern  x. 

__ 

E  46 

P 
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The  probability  that  "x"  falls  within 
hypersphere  S. 

-- 

E  46 
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n  n 

Complementary  probability  of  error 

-- 
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Pe  [H] 

Holdout  method  error  rate 

-- 
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-- 
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Pe  [U] 

U-method  error  rate 
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E  54 

Pe  W 

7r-method  error  rate 

-- 

E  54 

R(Sk) 

Risk  associated  with  deciding  class  k 

-- 

E  29 

S(g) 

P  misclassification 
r 

-- 

E  33 
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Training  set 
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Page 

p(v/s) 

Pr  v  given  s 

-- 

E  7 

p(x,Sk> 
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-- 
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Vectors  in  signal  space 
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Vectors  in  observation  space 
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MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  Of  <UANI>ARDs  I9b<  A 


1.0  Introduction 

Detailed  investigation  in  the  area  of  statistical  pattern  recognition 
was  motivated  in  this  study  by  the  necessity  to  use  mathematical  class¬ 
ification  algorithms  to  characterize  ground  clutter  and  noise.  The 
ultimate  goal  is  to  eventually  be  able  to  distinguish  between  the  presence 
or  absence  of  an  object  in  a  background  of  ground  clutter  and  noise. 

In  general,  statistical  pattern  recognition  enhances  the  capability 
to  develop  a  machine  that  will  imitate  man's  perceptive  ability. 

Research  towards  this  end  has  been  carried  out  in  the  areas  of  arti¬ 


ficial  intelligence,  interactive  graphics,  computer-aided  design, 

and  many  others.  There  are  some  well  developed  theories  behind 

15911  12 

statistical  pattern  recognition  *  *  *  .  They  evolved  from  all 

the  fields  previously  presented. 


Statistical  pattern  recognition  is  the  study  of  mathematical 
techniques  to  build  machines  to  aid  human  perception.  The  use 
of  computers  in  this  area  has  its  advantage  in  the  fact  that  it  is 
capable  of  handling  large  sets  of  data. 


Pattern  recognition's  function  could  be  conceptualized  in 

three  different  states  or  spaces  as  indicated  in  Figure  1.  0.  1, 
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pattern  space,  feature  space  and  classification  space  '  . 

The  physical  world  is  sensed  by  a  transducer  which  inputs  its 
results  into  pattern  space.  We  may  consider  the  physical  world 
as  an  infinite -dimensional  space  of  parameters.  The  transducer 
describes  a  representation  of  the  physical  world  which  is  in  terms 
of  R  scalar  values  where  R  is  typically  quite  large.  R  therefore 
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approaches  the  dimensionality  of  pattern  space.  Since  R  is  quite 
large  and  transducers  are  often  defined  in  terms  of  cost  rather  than 
the  specifications  of  pattern  recognition  itself,  computer  time  not 
being  insignificant,  it  is  desirable  to  reduce  the  dimensionality  of 
R  while  hopefully  minimizing  any  loss  of  information.  Reducing 
the  dimensionality  of  R  gives  us  a  new  N  dimensional  space  known 
as  feature  space  where  N  <<  R.  Classification  space  is  therefore 
a  decision  space  in  which  one  of  k  classes  is  selected  for  a  given 
sample.  It  is  therefore  k  dimensional. 

Though  one  may  question  the  necessity  of  a  feature  space,  it 
has  been  contended  by  many  that  the  greatest  advancement  that  is 
yet  to  be  made  in  specific  pattern  recognition  problems  will  be 
done  when  a  meaningful  pattern  to  feature  space  transformation 

5 

can  be  determined  .  This  is  so  because  pattern  space  is  always 
defined  by  available  data  sensors  which  are  often  defined  by  con¬ 
venience  rather  than  for  their  discriminatory  power.  Thus,  it 
is  not  unreasonable  to  conjecture  that  there  may  be  linear  or 
highly  nonlinear  combinations  of  the  convenient  parameters  of 
pattern  space  which  might  have  meaningful  classification  power. 

It  is  also  necessary  to  observe  that  parameters  that  may  suc¬ 
cessfully  discriminate  'p'  from  'q'  might  not  be  useful  in  dis- 

5 

tinguishing  ’p'  from  'z'  .  Hence,  feature  space  should  be  defined 
by  the  inherent  discriminatory  power  of  the  data  that  is  present 
in  pattern  space. 
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CONCEPTUALIZED  PATTERN 
RECOGNITION  PROBLEM 

FIGURE  1.0.1 
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Transducer  specifications  are  certainly  of  interest,  but  the 
core  of  this  paper  is  classification.  It  is  always  good  to  have  a 
good  classifier.  However,  an  ideal  feature  extractor  would  lessen 
the  need  for  an  optimum  classifier,  since  classification  would  be¬ 
come  less  difficult.  In  such  a  case,  even  a  mediocre  classifier 
would  do  an  excellent  job.  Conversely,  with  a  poor  feature  extractor, 
we  have  more  need  for  an  optimum  classifier. 

The  problem  of  classification  involves,  the  partitioning  of 
feature  space  into  regions  -  one  for  each  category.  In  general,  there 

i 

is  a  need  to  minimize  the  probability  of  error  by  choosing  the  most 
appropriate  arrangement  of  partitioning.  If  some  errors  are  more 
costly  than  others,  we  may  wish  to  reduce  the  average  cost  of 
making  an  error.  In  such  a  case,  the  problem  becomes  one  of 
statistical  decision  theory.  Classification  space  is  easiest  to  describe 
in  the  sense  that  it  is  k  dimensional  and  it  simply  contains  the  de¬ 
cisions  implemented  by  the  classification  algorithm.  Typically, 
these  classification  algorithms  which  define  the  space  partition 
the  N  dimensional  feature  space  into  disjoint  regions  -  each  region 
associated  with  only  one  class.  Figure  1.  0.  2  illustrates  the  parti¬ 
tion  of  some  data  in  such  a  manner.  The  separation  surfaces  are 
referred  to  as  hyperplanes  in  a  multidimensional  space  and  are  N-l 
dimensional.  Figure  1.  0.  2  is  also  ideal. 

How  well  a  particular  algorithm  performs  is  determined  by 
its  ability  to  minimize  the  probability  of  error  for  a  given  set  of 
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data.  The  rest  of  this  paper  will  be  devoted  to  the  analysis  and 
comparison  of  a  parametric  and  a  nonparametric  statistical  pattern 
recognition  algorithm  operating  on  simulated  radar  data  furnished 
by  the  USAF  Rome  Air  Development  Center  in  New  York. 

Programs  of  the  algorithms  used  for  this  analysis  were  made 
available  by  Bruce  R.  Kowalski  from  the  University  of  Washington, 
through  Arthur.  Arthur  is  a  collection  of  pattern  recognition/ 
general  data  analysis  Fortran  programs  designed  to  operate  as 
a  flexible,  expandable  and  portable  system.  "Pattern  Recognition,  " 
as  was  embodied  in  Arthur,  is  a  tool  designed  to  aid  in  making  sys¬ 
tematic  "educated  guess"  or  analysis  of  multidimensional  data  when 
direct  or  statistical  analysis  is  not  feasible. 


1.  1  Fundamentals  of  Statistical  Decision  Theory 

Inherent  in  rada  r  detection  is  the  problem  of  parameter  esti¬ 
mation.  In  order  to  give  some  significance  to  the  process  as  it  applies 
to  this  problem,  it  is  necessary  that  various  signals  and  spaces  asso¬ 
ciated  with  radar  detection  be  defined. 

Consider  the  representation  of  the  class  of  all  possible  signals 

as  vectors  's'  in  a  signal  space  f2,  where  each  point  in  the  space 

parameters  or  feature  values.  In  the  case  of  radar  detection,  such 
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features  may  be  amplitude,  phase,  doppler,  and  so  on  . 

In  a  similar  manner,  we  define  noise  and  clutter  space  which 

contain  points  'n'  that  describe  all  possible  waveform  realizations 
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of  the  noise  and  clutter  process  within  an  observation  interval  . 

Next,  observation  space  'T'  which  contains  points  v.  The  'v' 

represents  all  possible  joint  combinations  of  signal  and  noise  waveform. 

The  regularity  of  each  point  in  this  set  may  be  represented  as  an 

apriori  probability  distribution  function  p(v/s).  This  distribution 

basically  shows  the  dependence  of  waveform  'v'  on  points  in  signal 
15 

space 

Lastly,  let  us  define  'A  ' ,  decision  space  whose  points  'd' 
represent  a  set  of  possible  decisions  in  a  statistical  decision  pro¬ 
blem.  D(d/v)  is  used  to  describe  the  probability  density  of  each 
decision  in  decision  space  for  every  possible  point  'v'  in  observation 
space  r 


The  diagram  shown  in  Figure  1. 1.  1  shows  parameter  estimation 
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as  a  decision-making  process 
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Noise 

Space 


Figure  1.  1.  1  Reception  as  a  Decision  Problem. 


2.  0  Introduction  to  Parametric  Classification 


Parametric  classification  refers  to  the  development  of  a 

statistically  defined  discriminant  function  in  which  the  underlying 

5 

probability  density  functions  are  assumed  known  .  The  process  then 
simply  involves  the  estimation  of  a  few  critical  parameters  that  will 
define  the  densities  and  the  corresponding  discriminant  functions. 

Classical  techniques  in  the  pattern  recognition  context  provide 
a  basis  for  studying  parametric  classification  theory  which  repre¬ 
sents  the  most  restrictive  of  the  classification  techniques  with 
respect  to  a  priori  assumptions  on  the  prototypes  and  unknown 
data. 

2.  1  Discriminant  Functions 

Let  us  assume  that  we  have  k  pattern  classes  S  ,  S  ,  .... 

1  ^(k> 

Sk«  .  .  .  with  defining  prototypes  for  each  pattern  ,  where 

k  is  the  pattern  class  and  m  =  1,  2,  .  .  .  ,  m^,  and  represents  the 

count  of  the  pattern  in  class  k.  Speaking  in  the  context  of  pattern 

recognition,  what  we  need  ideally  is  a  function  which  measures  each 

point  in  pattern  or  feature  space  and  assigns  to  that  point  a  certain 

value  which  will  indicate  its  membership  in  any  given  class.  In 

pattern  recognition,  such  a  function  is  called  a  discriminant  func- 

5  12 

tion;  in  decision  theory,  it  is  called  a  probability  density  function  ' 

A  discriminant  function,  to  be  more  precise,  has  the  property  that 

it  partitions  the  pattern  or  feature  space  into  mutually  exclusive 

5  9  12 

regions  each  corresponding  to  a  particular  class  '  '  .  This 


I 
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function  is  defined  so  that  for  all  points  x  within  a  given  region 
describing  S.  ,  there  exists  a  function  g  (x)  such  that  g  (x)  g.(x) 

K  K  K  j 

for  all  k  ^  j  or: 

gk(x)  Kg.(x)  V  x  eS  and  V- M  j  (2.  1) 

3  k 

The  hyper- surfaces  separating  S,  and  S.  are  given  by  the 

k  j 

expression: 

gk(x)  -  gj(x)  «  0  (2.  2) 

This  amounts  to  the  points  which  have  equal  discriminant  functions 

for  both  classes  S,  and  S..  There  are  k(k-l)/2  such  separating 
X  3  5  11 

hyper-surfaces  in  a  k  class  problem  *  .  Often,  though,  not  all 

surfaces  will  be  significant,  and  redundant  hyper-surfaces  will 
develop  .  Figure  2.  1.  0  presents  an  example  of  such  a  situation. 
Figure  2.  1.  1  shows  a  discriminant  function  classifier  and  a  pos¬ 
sible  separating  surface  for  a  two  dimensional  space.  It  should 
also  be  pointed  out  that  adding  a  constant  or  applying  any  monotonic 

nondecreasing  function  (i.  e.  logarithm,  square,  etc.  ),  the  discri- 

5  12 

minant  function  leaves  the  decision  surface  unchanged  ’  .  Also, 

for  a  two  class  problem  a  single  discriminant  function  and  a  thres¬ 
hold  element  is  sufficient  for  classification. 

g(x)  =  'gjte)  -  g2(x)  (2.3) 

When  g(x)  is  positive,  the  class  chosen  is  and  when  it  is  negative, 

S2  is  chosen,  k-1  discriminant  functions  are  needed  to  separate  k 
classes. 

The  adjusting  of  a  discriminant  function  is  referred  to  as 


Maximum 

Detector 


gk(x) 


(a)  Classifier 


Decision  Surface 


_.l 


1  A  Typical  Classifier 
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training  or  learning.  If  the  training  is  based  on  known  statistics, 
certain  parametric  techniques  are  used.  But  if  it  is  based  on  an 
assumed  functional  form,  for  the  discriminant  function  (i.  e.  linear, 
quadratic,  etc.  ),  distribution  free  techniques  are  used. 

One  of  the  simplest  assumed  forms  for  the  discriminant  function 
is  known  as  the  linear  discriminant  function.  This  function  has  scalar 
and  vector  representation  as  shown  in  equations  2.  4(a)  and  (b)  below. 

gk<*)  =  W*  Xj  +  w£x2  +  .  .  .  +  w|Scn  +  W^+1  (2.  4a) 

0r  t 

gk(x)  =  W^x  (2.  4b) 

Where  X  =  (x  ,  x  ,  .  .  .  x  ,  1)  and  W  =  (W  ,  W  ,  ...W.W  J 
12  n  12  n  n+1 

are  the  augmented  pattern  and  weight  vectors,  respectively**. 

It  should  be  observed  that  the  scalar  term  W  ,  has  been  added 
to  the  discriminant  function  for  a  coordinate  translation  purpose. 

This  will  give  the  linear  discriminant  function  the  capability  to 
pass  through  the  origin  of  the  augmented  space  when  desired.  In 
other  words,  the  surface  separating  classes  and  is  also  lin¬ 
ear  and  may  be  defined  as: 

g  (x)  -  g.(x)  =  (W*  -  W*)X  =  0  (2.5) 

K  J  K  J 

A  simple  classification  algorithm  which  uses  a  linear  dis- 

5 

criminant  function  is  known  as  a  minimum  distance  classifier  . 

As  an  example  of  such  a  classifier,  let  the  average  point  of  the 
patterns  defining  a  given  class  be  given  by: 


(2.6) 


m=l 


Where  M  represents  the  number  of  patterns  in  class  S,  .  Then, 
k  k 

there  exists  k  such  points  in  pattern  space.  Let  us  consider  a 
Euclidean  metric  for  this  space  and  let  us  assign  an  unknown 
point  x  to  that  class  which  has  its  average  value  closest  to  x. 

The  decision  rule  may  then  be  written  as: 


x  e  S.  if  d(x,  Y.  )  =  n^in  d(x,  YR  )  (2.7) 

J  J 

however, 

d2(x,  Yk  )  =  (x  -  Yk  )*  (x  -  Yk  ) 

=  x*x  -  2x*  Y.  +  Y,  1  Y,  (2.8) 

k  k  k 

where  x  and  Y  are  column  vectors.  According  to  the  properties 

*  t 

of  a  discriminant  function,  we  may  subtract  the  constant  x  x  without 

5 

changing  the  decision  surface  .  In  any  case,  the  algorithm  calls 
for  minimum  distance.  Multiplying  by  a  negative  one-half  the  modi¬ 
fied  distance  squared  function  becomes  a  valid  discriminant  function. 

gk(x)  =  xt  Yk.  -  *5  Yk*  Yk  (2.8) 

In  the  context  of  discriminant  functions,  the  elements  of  Y,  be- 

t  * 

come  the  linear  weights  and  -  \  Y  Y,  becomes  the  aug- 

'  k  t 

menting  property.  There  exists  a  set  of  prototypes  Ym  assigned 

to  each  class  Sk-  Now,  if  there  exists  a  lineaj^iscriminant 

function  g„,  .  .  .  g.  ,  .  .  .  g_,  such  that  g.  (Y  )  g.(Y  for  all 
1  k  K  °k  m  m 

m  =  1,  .  .  . ,  M  and  for  all  k  /  j  then,  the  classes  are  said  to 

k 

be  linearly  separable. 
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The  next  step  in  sophistication  in  defining  discriminant  functions 

is  given  by  the  piecewise-linear  functions.  In  this  case,  the  separating 

surface  no  longer  defines  a  more  well  behaved  region  in  the  pattern  and 
5 

feature  space  .  Therefore,  a  piecewise-linear  machine  does  not  con- 

g 

tain  the  more  elegant  properties  possessed  by  linear  machines  .  A 

classic  example  of  a  piecewise-linear  machine  is  another  form  of 

5 

minimum  distance  classifier  .  In  this  case,  the  distance  of  an  unknown 


x  for  class  S,  is  given  by: 

K 


.00 


d(x,Sk)  =  min  d(x,ym  > 


(2.9) 


m=l,  . 


m. 


The  distance  being  considered  is  actually  the  smallest  distance  between 
all  patterns  of  class  and  the  unknown  x.  The  decision  rule  becomes: 


x  £ S .  if  d(x, S.)  =  min  d(x,S.  ) 
J  J  u  k 


(2.  10) 


The  corresponding  discriminant  function  to  such  an  algorithm  becoines: 

(2.  11) 


/  \  (k)  ,  _r  (k)t  k  x 

gk(x)  =  max  i  x  -  *s  Y_  Ym  > 


m  m 

A  surface  of  this  type  is  displayed  in  Figure  2. 1.  2. 


m 


In  order  to  introduce  another  step  up  in  sophistication  for  dis¬ 
criminant  functions,  it  is  convenient  to  introduce  at  this  point  the 
concept  of  a  generalized  decision  function  *  V  It  is  often  seen  in  the 
form  given  by  equations  2.  12a  and  2.  12b. 

d(x)  »  Wjf  <x)  +  W2f2<x)  +  .  .  .  +  W  f  <*)  +  Wk+1  (2.12a) 

or  in  vector  form: 

k+1 

d(x)  =  W  fj(x)  i  =  1,2  .  .  .  k  (2.  12b) 
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Piecewise  0 
linear 

discrimination 


Figure  2.1.2  A  Piecewise  Linear  Discriminant  Surface. 


Where  {  L(x)  }  are  real  single  valued  functions  of  pattern  x, 
f^  (x)  =  1,  and  k+1  are  the  number  of  terms  used  in  the  expan¬ 
sion.  The  form  of  equation  2.  12a,  b  are  representative  of  all  dis- 

5  11 

criminant  functions  *  .  The  various  kind  of  functions  may  be 

attained  through  variation  of  lT(x))  and  on  the  number  of  terms 
used  in  the  expansion. 

Let  us  define  a  vector  X*  whose  elements  are  L(x)  so  that 

fj(x) 

l2w 


1 


Now,  using  equation  2.  13  we  may  express  the  generalized  dis¬ 
criminant  function  as  shown  in  equation  2.  14: 

g(x)  =  W  X*  (2 

Where  W  =  (W  ,  W  .  .  .  ,  W.  ,  W  )  .  Note  that  x*  is  simply  a 
12  k  k+1 

k  dimensional  vector  which  has  been  augmented  by  one  as  pre¬ 
viously  discussed.  Hence,  equation  2.  14  represents  a  linear  fun¬ 
ction  relative  to  the  new  patterns  X*.  One  advantage  to  this 
approach  is  that  discussions  on  discriminant  functions  may  be 
restricted  to  the  linear  type  without  any  loss  of  generality. 

The  next  step  up  in  sophisitication  is  achieved  when  {  f.(x)  } 
are  of  polynomial  form  of  second  degree,  or  quadratic.  In  the 
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two  dimensional  case  x  =  (x  .  x  )  and  the  decision  function  is  of 

X  c* 

the  form:  2  2 

d(x)  =  WllXl  +  W12x2+W22x2  +  Wtxl+  W2x2+  W3 

This  may  be  expressed  in  terms  of  X*  in  the  linear  form  as: 
d(x*)  =  W  X* 

The  general  quadratic  form  may  be  expressed  as  shown  in  equa¬ 
tion  2.  16  if  all  combinations  of  components  of  x  which  form 
terms  of  degree  two  or  less  (i.  e.  ,  if  the  patterns  are  n-dimen- 
sional).  N  N-l  N 

g.  (x)  =  £  (Wk  x2  +  Wkx  )  +  £  £  Wk  x  x.  +  Wk 

k  n=  1  nn  n  n  n  n=l  jsn+l  °J  "  J 

and  in  vector  form: 

gk<x)  =  xt  Akx  +  xt  Bk  +  WW+1 


N  +  1 


(2.  15a) 


(2.  15b) 


(2.  16a) 


(2.  16b) 


2.  2  Parametric  Classification 


Let  us  consider  a  set  of  prototypes  in  n  space  with  a  known 
distribution.  Let  us  also  assume  that  these  points  came  from  a 
multivariate  normal  distribution  in  which  case  the  most  we  could 
learn  from  the  data  would  be  contained  in  its  mean  vector  and  sam- 
pie  covariance  matrix.  The  sample  mean  may  be  thought  of  as  the 
point  which  best  represents  all  the  data  x  in  terms  of  minimization 
of  the  sum  squared  error  from  all  prototypes.  The  sample  covari¬ 
ance  matrix  gives  information  on  the  spread  of  the  data  about  the 
mean. 


Naturally,  if  the  original  assumption  about  the  distribution 
of  the  data  is  incorrect,  the  statistics  are  worthless  when  speaking 


in  terms  of  the  information  they  give  you  about  the  samples.  Ob¬ 
viously,  second  order  statistics  would  merely  be  imposing  structure 

g 

on  the  prototypes  rather  than  revealing  its  true  structure  .  Figure 
2.  2.  1  displays  four  different  data  sets  with  identical  mean  and  covari¬ 
ance  matrix  and  yet  their  actual  structures  are  quite  different. 

Now,  with  the  understanding  that  a  parametric  pattern  recog¬ 
nition  machine  will  only  be  as  good  as  the  validity  of  the  assumed 
underlying  densities,  regardless  of  mathematical  elegance,  let  us 
choose  a  normal  distribution  for  the  analysis  of  this  section,  simply 
for  the  sake  of  its  relative  ease  of  manipulation. 

Statistically,  we  define  the  sample  mean  of  a  set  of  data 

points  as  shown  in  equation  2.  17 

P  =  E{  x  }  (2.  17) 

and  in  like  manner  the  covariance  matrix  of  equation  2.  18. 

<J>  =  E  { (x  -  v )  (x  -  v  )l  }  (2.  18) 

Where  the  E  O  represents  the  expectation  operator.  The  covari- 

5 

ance  matrix  is  real,  symmetric  and  positive  for  real  processes  . 

It  also  has  an  inverse  [  1  and  a  determinant  |<l>j  .  The  N-variate 

normal  distribution  may  then  be  written  as: 

P(x)  = - n'/5~  U  exP'  {  -3rf*-v)tUI  _1(x-u)  }  (2.19) 

o^r'I^r 

Where  - —  is  a  normalization  constant  which  makes  the 

area  boJtideli  by^  equation  2.  19  of  unit  value.  For  convenience, 
p(x)  may  be  rewritten  in  the  form  given  by  equation  2.  19b. 


(2. 19b) 


p(x)  =  N(  m,  [  4>  ]> 

When  the  exponent  of  equation  of  2.  19  is  constant,  the  lines  of  equal 

5  9 

probability  become  hyper-ellipsoidal  as  displayed  in  Figure  2.  2.  2  '  . 


In  the  context  of  the  subject  matter  being  treated,  it  is  of  con¬ 
siderable  importance  that  the  conditional  sensitivity  p(x/S^)  be 
defined.  Owing  to  our  knowledge  of  the  correct  classification  of  the 


known  data,  we  may  formulate  p(x/S^)  to  be  of  the  form  given  in 
equation  2.  20. 

P(x/Sk)  =  - NT? - uexp  )*  [  ]_1(x-u)}  | 

(2ir)  N/J  j*!'5  k 

Here  the  mean  and  covariance  matrix  for  each  class  now  takes  on 


(2.  20) 


a  significant  role.  It  is  intuitively  obvious  that  since  we  need  the 
first  and  second  order  statistics  to  specify  the  density,  that  the 
mean  and  covariance  matrix  take  on  the  values  shown  in  equation 


21a  and  2.  21b. 


>kJ  -  E  «Ym> 


=  E  (  Y  (k)  , 
v  m  1 

-  v  )(Y  (k)  -  u  )*  } 
km  k 


(2.  21a) 
(2.21b) 


With  the  use  of  the  previous  information  of  this  section,  the 
discriminant  function  for  the  symmetric  loss  function  for  the  Bayes 
(Classical)  technique,  which  will  be  treated  in  the  next  section, 
may  now  be  calculated  as: 

gR(x)  =  P(Sk)  p(x/Sk)  (2.  22a) 

or  for  analytical  convenience: 

g  (x)  =  log {  P  (S  )  p(x/S  )  }  (2.22b) 

k  k  k 

since  the  log  function  is  monotonic  and  nondecreasing.  Simpli- 
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fying  equation  2.  22b  gives  us  the  following  form  for  gk<x): 
gfc(x)  =  log  P  (Sk)  -  ^ log  2  it  -  log  |*k| 

h  -  (  (x  -v  f[  ♦  k]  _1(x  -  v  k)  )  (2.23) 

Eliminating  the  term  which  is  common  to  all  such  discriminant 
functions,  we  obtain 

gk(x)  =  -h*  [<f'kr1x  +  xt[*k]-1nk  1  Mk 

+  log  P(SR)  -  \  log  ]  (2.24) 

In  order  to  proceed  with  more  arguments  on  this  subject,  for  mathe- 

5 

matical  simplicity,  another  simplifying  assumption  is  necessary  . 

Let's  assume  that  the  covariance  matrix  for  each  class  is  the  same, 

since  this  is  a  very  common  occurrence  in  deterministic  communica- 

5 

tion  systems  that  are  perturbed  by  white  Gaussian  noise  .  In  this 

1x,  and  -  5slog  |<^  become  common  to 
all  the  discriminant  functions  and  hence  may  be  eliminated  from 
equation  2.  24.  Its  new  form  is  presented  in  equation  2.  25. 

gR(x)  =  xt  [  <),]  uk  -  %  Mkt4]_1Mk  +  log  {  P(Sk)  }  (2.25) 

where  the  eight  function  W  and  the  term  which  is  used  for  coordin- 
(k) 

ate  translation  are  given  as: 

w  - 

wN+i  ■  -  **  +lo*p<v 

respectively. 

2.  3  Classical  Technique 

The  diagram  shown  in  Figure  2.  3.  1  is  of  some  pertinence  to 
this  section  since  it  basically  summarizes  parameter  estimation 


case,  the  terms  -h  x  [  <(^] 
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as  a  decision  making  process  in  terms  of  the  various  spaces  des¬ 
cribed  previously  in  Section  1.  2. 

A  decision  rule  (discriminant  function)  may  be  interpreted  as 
an  operation  which  maps  points  from  observation  space  into  de¬ 
cision  space.  With  this  in  mind,  it  is  quite  clear  that  to  optimize 
the  decision  process  one  would  like  to  have  an  optimum  decision. 

In  order  that  we  may  have  some  way  to  evaluate  the  performance 

of  these  decision  rules,  let  us  define  a  cost  function  C(S  /S,  ), 

l  k 

which  will  be  the  loss  incurred  when  a  sample  pattern  x  belonging 

5  9 

to  class  S,  is  misclassified  to  class  S.  *  .  This  cost  or  loss 

K  1 

function  has  the  advantage  of  providing  the  capability  to  weight 

specific  recognition  errors  more  heavily  than  others.  In  order  to 

make  use  of  this  function,  it  is  useful  to  compute  a  conditional 

average  loss,  L(x,  S,  )  as  shown  in  equation  2.  26. 

k  k 

L(x,  S,  )  =  I  C(  S./S.)  p(S./x)  (2.26) 

i=l  K  l  i 

The  average  loss  represents  the  sum  of  individual  losses  weighted 

by  their  probability  of  occurrence.  If  L(x,  Sk)  is  minimized,  then 

our  pattern  recognition  machine  is  statistically  optimized  in  the 

5 

Bayes  sense  and  is  often  referred  to  as  a  Bayes  machine  .  In  order 

to  minimize  losses,  this  machine  must  assign  prototype  x  to  category 

Sfc  when  L(x,  S^)  ±  L(x,  S  )  for  all  i  =  1,  .  .  .  ,  k.  This  implies  that 

L(x,  S.)  must  be  calculated  for  each  of  the  k  classes.  An  apparent 

discriminant  function  then  becomes 

g,  (x)  =  -L(x,  S.  ). 
k  k 

However,  realizing  Bayes  rule  in  equation  2.  28a, 


(2.  27) 


(2.  28a) 


p(x/s  )p(S.) 

p<Vx)  =  — sir — 

allows  us  to  rewrite  the  discriminant  function  omitting  (p(x)  ) 

since  it  is  common  to  all  terms,  as  shown  in  equation  2.  28b: 

k 


£  (x,  S.)  =  i  C(S,  IS.)  p(x/S.)  P(S.). 
1  i=i  k  1  11 


(2.  28b) 


This  we  will  realize  as  an  unconditional  average  loss  while  observing 
the  p(x)  statistics  is  missing.  Thus  far,  the  conditional  average  loss 
has  been  taken  as  a  value  assigned  to  each  class  S  at  some  point  x 

K 

in  pattern  space.  If  this  term  is  integrated  over  the  entire  decision 
space  we  obtain  a  risk 

R(Sk)  =  L(x,Sk)  p(x)  dx.  (2.29) 

The  Bayes  rule  is  then  applied  to  minimize  the  risk  associated  with 
deciding  that  a  particular  class  is  present.  The  statistics  for  p(x) 
are  unknown.  However,  to  minimize  the  risk,  we  should  minimize 

the  maximum  worst  assumption  possible  on  the  distribution  of  p(S  ) 

- 1  k 
which  is  uniform  P(S  )  =  k  for  all  k=l,  .  .  . ,  k  classes.  This 

**  5 

principle  is  known  as  the  worst  criterion  on  a  priori  statistics  ‘ 


For  some  further  illustrations,  let  us  consider  the  symmetric 
loss  function: 

C(S.  /S.)  =  1  -  6  (i  -  k)  (2.  30) 

k  i 

where  <5  (i  -  k)  is  the  Kronecker  delta  function.  Hence, 

C(S  /S.)  =  °  1  “  l 

k  i  1  l  f  k 

This  basically  states  that  there  is  zero  loss  associated  with  making 
the  correct  decision  and  one  unit  of  loss  associated  with  making  the 
correct  decision  and  one  unit  of  loss  associated  with  making  a  wrong 
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decision.  This  choice  of  C(S  /S.)  represents  the  designer's  per- 

K  1 

sonal  bias  since  it  could  be  chosen  differently.  The  Bayes  decision 

rule  for  this  loss  function  is: 
k 

i(x,  S.)  *  .JL  <1  -  6  (i  -  k))p(x/S.)  P(S.)  (2.31) 

K  1“  1  11 

which  may  be  simplified  to 


Mx,SR  =  p(x)  -  p(x/Sk)  P  (Sk)  (2.32) 

Now,  to  minimize  £(x,Sk),  we  maximize  p(x/Sk)P(Sk>.  The  Bayes 

decision  rule  becomes:  choose  S,  if: 

k 

p(x/Sk)p(Sk)  >  p(x/S.)P(S.)  (2.  33) 

In  terms  of  a  likelihood  ratio,  we  have 

p(x/Sk) 

p(x/sj  (2.  34) 

which  simplifies  to  the  choice  of  category  Sk  if 


A  > 


p(St) 


(2.  35) 


which  is  known  as  the  unconditional  maximum  likelihood  decision. 
An  obvious  discriminant  function  is  given  by  equation  2.  36. 


gk(x)  =  P(Sk)  p(x/Sk)  (2.36) 

or  for  analytical  simplicity 

gk(x)  =  log [ P  (Sk)  P  (x/Sk) ] 

The  decision  surface  might  also  be  expressed  as  seen  in  equation 
2.  37a  and  2.  37b. 

gk(x)  -  g.(x)  =  0  (2.37a) 
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or 


P  (S.  )  p(x/S  ) 

log  t  - - - , — —  )  = 

g  P  (S.)  p(x/S. ) 

l  1 


(2.  37b) 


Figure  2.  3.  1  shows  a  block  diagram  of  the  representation  of  a 
pattern  recognition  machine  for  a  Bayes  classifier. 

2.  4  Bayes  Algorithm  Used 


This  algorithm  is  an  approximation  of  the  Bayes  multivariate 
classification  technique.  It  produces  frequency  histograms  for 
each  feature  over  each  and  all  categories.  At  this  point,  it  is 
important  to  keep  in  mind  that  the  accuracy  of  the  results  pro¬ 
duced  by  this  algorithm  will  be  dependent  on  how  representative 
are  the  frequency  histograms  produced  of  the  true  underlying 
distribution  of  the  various  features.  The  algorithm  is  considered 
an  approximate  Bayes  classifier  because  the  true  underlying 

distributions  are  not  known  and  are  only  being  approximated 

16 

by  the  frequency  histograms 

16 

The  loss  function  used  here  is: 


C(S.  /S.)  = 
k  i 

that  was  previously  discussed. 

C(Vsi>  ■ 


1  -  5  (i  - 

Where 
0  i  =  k 

1  i  /  k 


k) 


The  program  is  quite  modular  and  this  could  easily  be  changed 
but  such  was  not  the  case. 


The  decision  function  for  the  algorithm  is  given  as  the  sum¬ 
mation  over  all  features  of  the  probability  that  a  given  pattern 
belongs  to  category  k  as  shown  below: 
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f 


Mx> 


NVAR 
j  =  Zl 


where  a  =  .5,  1  and  2  and  also 


NVAR 

g(x)  =  £  in  (P[x,  k/x-  .]) 

i=  1  J  >  *  A  >  J 


For  any  further  information  on  this  algorithm,  see 
Sub -Appendix  E-A. 
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3.  0  Nonparametric  Classification 

Nonparametric  techniques  in  statistical  decision  models 

are  often  resorted  to  when  underlying  probability  densities 
5 

are  unknown  .  Nonparametric  algorithms  are  implementable  with¬ 
out  reference  to  any  specific  distribution  and  are  referred  to  as  a 

g 

"distribution  free"  technique  . 

Let  us  consider  a  set  of  patterns  and  their  randomly  spe¬ 
cified  classes  (x  ,  0  ),  (x„,  0),  .  .  . ,  (x  ,  0  );  and  the  problem 
112  2  n  n 

of  classifying  some  unknown  pattern  in  observation  space  x  ,, 

n+1 

in  terms  of  the  known  n  patterns.  Assume  that  a  pattern  x 

takes  on  some  value  in  observation  space  and  that  the  are 

random  variables  which  take  on  values  of  either  zero  or  one. 

Let  g(x  ;  (x  ,  0, ),  .  .  . ,  (x  ,  0  ))  be  some  arbitrary  estimator 
n+1  1  1  n  n  n 

defined  on  Xx(Xx  0)  which  assigns  to  every  X^+1  an  estimate 

g  =  0  or  one  based  on  the  n  training  samples.  This  implies 

that  g  partitions  the  set  X  into  two  subsets.  Assume  once  more 

that  G  =  {  g  }  is  the  set  of  all  such  decision  rules.  For 
a 

example,  G  may  be  the  set  of  all  k  nearest  neighbor  rules.  A 
major  concern  is:  how  does  one  select  the  best  procedure  for 
the  assignment  of  ? 

As  a  foundation  for  further  remarks  on  the  topic,  either 

of  two  basic  assumptions  on  the  homogeneity  of  (x  ,  Q  ),  ...  , 

1  1  1 

(x  ,  0  ),  (x  ,  0  )  must  be  made  . 

n  n  n+1  n+1 

i)(x,,0.),  .  .  .  ,  (x  ,  0  ,)  is  a  collection  of  n+1 

1  1  n+1  n+1 

independently  and  identically  distributed  random  variables. 
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Dependence  between  x  and  is  allowed. 


ii)  x,.x~  .  .  .  .  x~  ,  x~  e  X  and  0  .  ©_....,  0  , 

12  n  n+1  12  n 


0  e  0  are  arbitrary  sequences.  A  permu¬ 

tation  7r  of  1,2,...,  n+1  is  chosen  at  ran¬ 
dom  according  to  a  uniform  distribution  on  the 
set  of  (n+1)!  permutations.  Then  an  assignment 
x.  =  x  (i);  0.  =  e  (i);  i  =  1,  2,  .  .  .  ,  n+1  is 

1  It  1  7T 

made. 

Let  us  assume  that  S(g)  is  the  probability  of  error  asso¬ 
ciated  with  making  the  assignment  of  x/s  i  =  1,  2,  .  .  . ,  n  to 
the  remaining  x.'s.  For  a  more  precise  mathematical  represen¬ 
tation,  let  a  be  a  permutation  1,  2,  .  .  .  ,  n  .  Also,  let 
<S(  0,  0  )=  1  or  0  as  a  direct  consequence  of  0  /  0  or  0  =  0 . 

We  may  then  define  S(g)  mathematically  as  shown  in  equation  3.  1. 

S(g)  =  1/n!  E6  [  0  (i);  g(x  (i);  (x  (i),  0  (j)), 

o  a  00 

j  *  1,  2,  ....  n  j  ^  i]  (3.  1) 

For  any  given  g,S(g)  will  be  a  random  variable  whose  distribu¬ 
tion  is  governed  by  the  distribution  of  (x.*  0  ,)'s. 

In  general,  the  classification  of  will  be  formulated 
in  the  following  way.  Firstly,  a  permutation  of  a  of  1,  2, 

.  .  . ,  n  will  be  chosen  according  to  an  equiprobable  distribu¬ 
tion  of  the  n!  permutations,  x  .  will  then  be  given  the 

n+1 

classification  as  shown  in  equation  3.  2. 

6  =  g(x  (x  (1),  0  ( 1 )),  .  .  .  ,  (x  (n-1), 

n+1  n+1  a  a  a 

©  (n-1))).  (3.2) 

a 
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The  permutation  0  is  necessary  for  bringing  symmetry  to  the 
data  so  that  the  order  in  which  the  observation  takes  place 
will  not  be  important  *.  Now,  the  risk  associated  with  the 
classification  procedure  R(g)  may  be  expressed  as  shown  in 
equation  3.  3  where 

R(g)  =  Pr{  0  .  *  0  J  (3.  3) 

n+1  n+1 

the  probability  is  taken  with  respect  to  the  distribution  of 
the  (x^,  qJ's  under  either  of  the  previous  assumptions  as  well 
as  the  distribution  on  0  . 

One  very  important  point  is  the  fact  that  S(g)  is  an  un¬ 
biased  estimator  of  the  probability  of  error  in  using  g  on 

x  „  in  the  sense  that 
n+1 

R(g)  =  E  {  S(g)  |  R(g)  }  (3.4) 

where  the  expectation  is  taken  over  the  distribution  on  (x,  ®  ), 

.  .  .  ,  (x  ,  0  )  and  a  .  Now,  an  optimum  classifier  in  G  =  g  is 
n  n  a 

the  one  which  minimizes  R(g  ).  However,  since  for  these  non- 

a 

parametric  cases,  we  do  not  know  R(g  )  we  must  choose  the  clas- 

1  a 

sifier  which  minimizes  S(g  )  .  Notwithstanding  the  above  state- 

& 

ment,  it  is  felt  that  in  practical  situations,  this  procedure 
will  develop  good  decision  rules  \ 

Let  the  n  samples  in  the  previously  defined  training  set 
be  divided  into  k  disjoint  subsets,  each  containing  r  samples. 

I* 

Let  g  be  defined  on  Xx(Xx  0  )  .  g  will  then  receive  scores 
S  (g),  S  (g),  ....  S  (g)  for  the  errors  associated  with  the 

12  K 

various  blocks  of  r  patterns.  Note  that  under  assumption  i, 
the  blocks  are  independent.  S^(g)  i  =  1,  2,  ....  k  is  a  set  of 
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independently  and  identically  distributed  random  variables  with 
common  mean  R(g).  Therefore, 

k 

S  =  I  S.(g)/k  (3.5) 

i=l 

is  an  unbiased  estimator  of  R(g)  for  which  the  variance  approaches 

zero  at  a  rate  0(l/k)*.  Let  {  a„  },  {  b  }  ,  n=l,  w,  ....  be  two 

n  n 

sequences  of  numbers.  We  say  that  { a  }  is  0(b  ),  (of  the  order 

n  1l*1 

of  b  ),  and  we  write  a  =  0(b  );  if  a  /b  - >0. 

n  n  n  n  n  ru« 

The  next  section  will  illustrate  a  set  of  decision  rules  which  are 
pertinent  to  the  context  of  this  discussion  and  this  paper. 

3.  1  Nearest  Neighbor  Pattern  Classification 

For  the  sake  of  clarity,  let  us  reassert  a  few  points  from  the 
principles  of  nonparametric  pattern  classification  in  order  to  lay 
the  foundation  for  the  brief  principles  of  the  k-Nearest  Neighbor 
rules  (k  NN). 

The  domain  of  nonparametric  statistical  pattern  recognition 
is  rather  resttictive  in  the  sense  that  an  optimal  decision  rule  is 
unattainable  on  the  basis  of  the  underlying  statistics  of  the  data 
under  consideration  ?  This  is  so  because,  in  cases  where 

the  technique  is  used,  knowledge  of  the  underlying  distributions  are 
usually  unknown  except  what  is  inferred  from  the  samples.  A  de¬ 
cision  to  classify  a  point  x  in  observation  space  into  category  is 
allowed  to  depend  only  on  n  correctly  classified  samples  (x^,  0  ^), 

(x„,  e„)  .....  (x  ,  0  );  and  a  decision  procedure  which  is  often 

2  2  n  n  j 

by  no  means  a  clear  cut  one  .  The  two  previous  assumptions  of 
section  3.  0  still  hold,  namely,  the  classified  samples  (x^,  0  )  are 
identically  and  independently  distributed  according  to  the  distribu- 
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tion  of  (x,  q  ) 

On  this  basis,  certain  heuristic  arguments  will  be  made  about 

decision  rules  for  the  k-Nearest  Neighbor  techniq. —  Based  on  some 

given  measure  of  similarity,  it  is  fair  to  say  that  patterns  which  are 

close  together  will  have  the  same  classification,  or  they  should  have 

fairly  similar  a  posteriori  probability  distributions  on  their  respective 

classification.  Thus,  to  classify  a  point  in  observation  space,  we 

could  bias  our  decision  on  the  basis  of  nearness  which  provides  the 

basis  for  one  of  the  simplest  and  most  commonly  used  decision  pro- 

14 

cedures,  the  Nearest  Neighbor  rule  (NN)  .  The  first  formulation 

of  these  Nearest  Neighbor  rules  were  made  by  Fix  and  Hodges. 

Surprisingly  enough,  although  simple  in  concept,  it  has  been  shown 

that  in  the  worst  case  the  k-nearest  neighbor  rule  has  a  probability 

19  14 

of  error  which  is  less  than  twice  that  of  the  Bayesian  error  rate  *  ' 

Now  let  us  consider  a  set  of  n  patterns  (x, ,  9  (x  ,  ) 

11  n  n 

where  each  pattern  xi  belongs  to  category  9  .  and  takes  on  in  a  metric 

space  x  upon  which  is  defined  a  metric  d.  Consider  a  new  observation 

(x,  9  )  where  only  x  is  observable  and  the  corresponding  category  9 

is  unknown.  Based  on  the  information  contained  in  a  set  of  correctly 

classified  patterns,  a  point  x1  e{x,.x . x  }  is  a  nearest 

n  1  2  n 

neighbor  of  x  if  min  {  dfx^x)  =  d(x',x)  |  i  =  1,  2,  .  .  . ,  n)  > 

This  rule  will  assign  x  to  category  '  if  its  nearest  neighbor  is 

x'  .  An  apparent  error  is  made  when  9  1  ^  0. 

n  n 

For  the  k-Nearest  Neighbor,  as  one  might  expect,  x  is  clas¬ 
sified  by  assigning  it  the  label  most  frequently  observed  among  the  k 
nearest  samples. 
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3.2  k-NN  Algorithm  Used 


The  choice  of  k  for  the  k-NN  technique  to  be  used  in 

this  paper  are  k  =  1,  3,  7  and  10.  There  is  a  rule  of 

thumb  which  suggests  that  the  choice  of  k  should  be  at 

me  t  the  number  of  patterns  being  used  divided  by  five  or 

ten  approximately^.  The  reason  for  this  is  that  k  is 

inversely  proportional  to  the  probability  of  misclassifi- 

cation,  where  the  number  of  samples  is  much  greater  than 

k.  Considering  the  definition  of  k-NN,  one  could  see 

that  it  would  be  ambiguous  to  choose  k  close  to  or  greater 

14 

than  the  number  of  patterns  in  a  given  class  .  It  will 

become  obvious  in  Section  5.0  that  this  procedure  is  an 

attempt  based  on  knowledge  of  a  training  set  to  develop  a 

conditional  probability  distribution,  P(W^/x),  where 

represents  class  i.  We  would  like  to  have  each  data  set 

possessing  a  fairly  high  density  of  patterns  because  we 

want  all  k-NN,  x* ,  of  an  unclassified  pattern,  x,  to  be 

very  close  so  that  P(Wi/x)  PCW^x').  Although  large  k 

reduces  the  probability  of  error  for  large  sample  sizes, 

we  would  like  to  restrict  its  size  so  that  the  chances  of 

9  16 

x'  and  x  being  close  to  one  another  are  very  good  * 

The  criterion  for  nearness  is  defined  on  the  basis 

of  interpattern  distance.  To  be  specific,  the  Mahalanobis 

distance  similarity  function  was  used.  This  function 

normalizes  distance  in  order  to  make  the  analysis  invar- 

g 

iant  to  displacement  and  scale  changes  .  For  further 
details  on  the  k-NN  technique  used,  consult  Sub-Appendix 


The  use  of  other  similarity  measures  could  cause  the  algorithms 
to  perform  differently.  In  general,  any  non-negative  real  valued  fun¬ 
ction  d(x.,x.)  that  satisifies  the  following  requirements  may  be  con- 

1  J  12 

sidered  a  distance  function 


(a)  d(x.,x  )  >  0  for  all  x.  and  x.  in  euclidean  space; 

1  3  1  3 

(b)  d(x„,x.)  =  0  if  and  only  if  x.  =  xj; 

1  j  i 

(c)  d(x.,x.)  =  d(x.,x.) 

13  3i 

(d)  d(x.,x.)  <d(x., x,  )  and  +  d(x,  , x.); 

l  j  —  l  k  k  j 

where  x  ,  x  and  x,  are  any  three  vectors  in  Euclidean  space, 
i  3  k 

The  value  specified  for  d(x^,xj  represents  the  distance  between 

data  units  x.  and  x.. 

i  3 

Table  3.  2.  1  gives  a  display  of  some  commonly  found  distance 
functions  12 . 
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Some  Distance  Functions 


NAME  FORM 


T 


1. 

Euclidean 

d2(Xi'Xj)  = 

[£  <xki  ' 
k=l  K 

\2i  % 

skj>  1 

P 

2. 

i  norm 

d  (X.,X.)  = 
1  i  3 

'  *  i  "ki ' 

k*l 

V '  V 

3. 

Sup- norm 

d  (X..X.)  = 

k-i.2™* 

{  |x  .  - 
.  ,  P  1  ki 

P 

xkj|P  1 

4. 

l  norm 
p 

d  (X.,X.)  = 
P  1  J 

Is  lxki" 

5. 

Mahalanobis 

D2(X.»X.)  = 
1  J 

(^7-  x.)Tw_1(x.  -  x.) 

1  3  13 

Table  3.  2. 1  Some  Distance  Functions 
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4.  0  Data 


Simulated  radar  ground  clutter  data  furnished  by  William  L. 
Simkins,  Jr.  ,  of  Rome  Air  Development  Center  (RADC)  was 
available  for  analysis.  This  data  was  made  available  as  a  re¬ 
sult  of  research  sponsored  by  the  USAF/RADC  Post-Doctorial 
program  under  contract  No.  CCT-SC-0102-937.  This  set  of  data 
consists  of  65,536  samples,  each  of  which  may  be  interpreted  as 
points  in  a  four  dimensional  space.  The  four  parameters  are  the 
x  and  y  coordinates  of  the  region  under  consideration  and  a  measure 
of  the  amplitude  and  doppler  of  radar  returns  from  the  given  x/y 
coordinates.  It  will  be  assumed  that  varying  combinations  of  each  of 
these  parameters  is  adequate  to  describe  a  sample. 

The  simulated  clutter  information  is  on  magnetic  tape. 

There  is  amplitude  data  in  the  first  file.  This  amplitude  infor¬ 
mation  is  displayed  in  a  pseudo- color  photo  as  shown  in  Figure 
4.  0.  1.  There  is  a  color  code  at  the  bottom  with  the  lowest  amp¬ 
litude  of  zero  represented  by  black  with  each  color  representing 
a  different  category.  The  size  of  the  field  of  the  various  categories 
may  be  summarized  as  shown  in  Table  4.  0.  1.  The  second  file 
contains  measurement  of  the  doppler  spread  of  zero  mean  signal. 
This  feature  is  displayed  in  a  pseudo-color  photo  in  Figure  4.  0.  2. 
There  is  also  a  color  code  at  the  bottom  of  this  color  photo,  five  of 
these  colors  fully  describe  the  Doppler  information.  Five  categories 
are  presented,  each  of  which  contain  a  one- number  doppler  range  as 
shown  in  Table  4.  0.  2. 
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FIGURE  4.  0  1  PSEUDO  COLOR  PHOTO  OF  AMPLITUDE  DATA 
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Category 

Width  of  Field/ Category  (Amp.  Range) 

1 

1 

2-12 

22 

13 

13 

Total  13 

256 

Table  4.  0.  1 


Category  _ _ Doppler  Measure  (Hz) 


■  g - V  ■ 

1 

■  ■  _ _ _ -  Jt  : _ : _ _ 

0 

2 

49 

3 

98 

4 

196 

5 

147 

Total  5 

5  bits  of  doppler 

Table  4.  0.  2 
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With  the  use  of  our  knowledge  of  the  nature  of  the  data, 
a  standard  procedure  for  obtaining  the  x/y  parameters  for  each 
sample  was  developed.  Because  of  experience  with  the  algor¬ 
ithms  that  were  used  to  analyse  these  data,  it  is  impractical 
to  consider  working  with  all  65  K  samples.  Figure  4.  0.  3  dis¬ 
plays  a  plot  of  execution  time  for  both  algorithms  versus  num¬ 
ber  of  samples.  Hence,  a  representative  sample  was  chosen 
with  the  help  of  the  pseudo  color  phototgraphs. 

Viewing  the  pseudo  color  photographs  as  a  two  dimensional 
x/y  graph,  the  x  and  y  parameters  range,  in  magnitude  from 
zero  through  256.  The  data  used  for  this  analysis  is  bounded 
by  y  -  128-135  and  x  =  1  -128. 

The  parameters  used  in  this  analysis  were  the  x  and  y  co¬ 
ordinates  and  the  amplitude  of  the  radar  return.  The  reason 
for  this  was  that  samples  of  real  radar  data  whose  parameters 
would  be  similar  to  the  ones  which  are  being  used  were  expected 
for  comparison  and  this  would  provide  a  basis  for  comparing  how 
the  algorithms  work  with  both  real  and  simulated  data. 
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Figure  4.0.3  Execution  Time  for  Both  Algorithms 
Together  Versus  Number  of  Samples 


5.  0  Error  Analysis 

The  purpose  of  this  chapter  is  to  explore  the  analytical 
nature  of  the  error  bounds  of  the  suboptimal  nonparametric  pat¬ 
tern  recognition  classification  technique  known  as  the  k  nearest 
neighbor.  It  has  been  shown  that  this  technique  produces  an 
error  rate  which  is  greater  than  the  minimum  possible  P*  and 
has  an  optimistic  upper  bound  of  approximately  2  P*.  p*  is  also 
achieved  in  a  practical  situation  when  we  have  accurate  a  priori 

9 

information  on  the  distribution  of  the  data  under  analysis  . 

Since  the  Bayes  error  rate  is  in  fact  the  optimum,  it  will 
obviously  be  the  lower  bound  for  any  other  technique  including 
the  k-NN.  A  tight  upper  bound  shall  be  analytically  establi¬ 
shed  for  the  k-NN  technique  in  order  to  substantiate  expecta¬ 
tions  of  results  of  analysis  carried  out  on  samples  of  the  data 
previously  described. 

Let  us  consider  a  set  of  points  in  observation  space  x  - 

{(x,,0  .),  (x  .  0  ),  .  .  .  ,  (x  ,  0  )  where  .  is  a  random 
112  2  n  n  l 

variable  representation  of  the  category  of  pattern  x..  takes 

on  values  of  W.  i  =  1,  .  .  .  ,  C,  where  C  =  the  number  of 
i 

classes.  Also,  let  x'^  X  be  the  nearest  neighbor  of  observation  x. 

Recalling  the  nearest  neighbor  rule,  we  see  that  it  would  assign 

observation  x  to  category  0’  .  Now  the  chances  that  (?  =  W. 

may  be  represented  by  a  conditional  probability  function 

PfW./xM.  Assuming  that  we  have  a  very  large  sample  (n 

it  can  be  shown  that  x  is  close  enough  to  x'  to  that  P(W./x'  ) 

n  in 
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P(W./X)9.  Let  us  define  W  (x)  as  the  category  which  opti- 
1  m 

mizes  the  distribution  such  that: 

P(W  /x)  =  Max  P(W./x) 
m  l 

1 

By  definition  an  optimum  decision  rule  is  one  which  selects 

in  all  cases.  The  minimum  error  associated  with  classifying  an 

observation  x  may  then  be  expressed  as: 

P*  (e/x)  =  1  -  P(W  lx) 
m 

and  hence  the  minimum  unconditional  average  probability  of 
error  over  observation  space  may  then  be  expressed  as: 
p*  =  p*  (e/x)  p(x)  dx 

There  will  be  fluctuations  in  the  error  rate,  for  different  sets  of 
n  samples.  This  will  certainly  be  the  case  since  for  each  sam¬ 
ple  used  in  the  classification  of  observation  x,  there  will  be  flu¬ 
ctuations  in  the  nearest  neighbor  vector  x'^.  This  implies  a 
joint  dependence  of  the  n  sample  error  rate,  P^  (e/x,  x'n)»  on 
both  x  and  x'^.  Averaging  over  x’  yields: 

P  (e/x)  =  P  (e/x,x'  )  P(x'  /x)  dx' 
n  n  n  n  n 

With  the  previous  assumptions  on  the  sample  size  and  the 
fact  that  x'  is  the  nearest  neighbor  of  x,  it  is  intuitively  ap¬ 
pealing  to  chose  p(x'  lx)  to  be  a  delta  function  centered  about 

n  5  9 

x,  which  is,  in  fact,  not  a  bad  assumption  *  .  Suppose  we 
call  the  probability  that  any  sample  falls  within  a  hypersphere 
S  centered  about  x,  is  some  positive  number  P  .  Then  the 
chance  that  all  of  the  n  independently  drawn  samples  fall  out¬ 
side  the  hypersphere  may  be  represented  as  (1-P  ),  which 

s 


(5.  1) 


(5.2) 


(5.  3) 


(5.4) 
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approaches  zero  as  n  approaches  infinity. 

Recalling  assumption  i  of  section  3.  0  and  considering  the 

fact  that  it  still  holds  true,  the  complementary  conditional 

probability  of  error  may  be  written  as  shown  in  equation  5.  5. 

P(  0,  0  '  / x, x*  )  =  P(  0  /x)  P<  0'  /x'  ) 
n  n  n  n 

Since  x  and  x'  are  nearest  neighbors  0  =  0'  =  W. 

n  c  n  l 

P  (e/x, x'  )  =  1  -  E  P(W./x)  P(W./x'  ) 
n  n  x  in 

In  order  to  obtain  an  expression  for  P^fe)  equation  5.  6  is 
substituted  into  equation  5.  4  and  this  expression  is  averaged 
over  x.  Recall  that  n  approaches  infinity  and  p (x'/x)  approa¬ 
ches  a  delta  function.  If  P(Wjx)  is  continuous  at  x,  the  equa¬ 
tion  simplifies  to: 

lim  P  (e/x)  =  [1-  r  P(W./x)  P(W./x'  )] 

n-*-  n  i=l  1  1  n 

6  (x'  -  x)  dx' 

n  n 

=  1  -  E  P2  (W./x)9 

i=l  1 

The  asymptotic  nearest  neighbor  error  rate  may  be  developed  as 
shown  in  equation  5.  8. 

P  =  lim  P  (e) 
n-x»  n 

C  2 

=  l  1  -E  P  (W./x)]  p(x)dx 
i=l  1 

P*  is  a  lower  bound  for  the  error  rate  of  the  nearest 
neighbor  technique.  Furthermore,  it  is  also  fair  to  say  it  is 
a  tight  lower  bound  since  there  is  always  a  set  of  conditionals 
and  priori  probabilities  given  which  P*  is  attainable.  There- 


(5.5) 

(5.6) 


(5.7) 

(5.8) 
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fore,  the  problem  now  lies  in  the  determination  of  a  tight  upper 
bound  for  P. 

In  order  to  find  an  upper  limit  on  P,  we  must  determine 

how  small  2  P(W^/x)  of  equation  5.  8  can  be  for  a  given 

P(W  /x).  This  summation  may  be  minimized  subject  to  the 
m 

following: 

(1)  P(W./x)  ^  0 

(2)  E  P(W./x)  =  1  -  P(W  /x)  =  P*(e/x) 
i*m  1 

c  2 

E.  .P  (W./x)  will  be  minimized  if  we  choose  each.P  (W  /x)  equal 
i=l  l  ifm  i 

to  one  another.  This  implies: 

PHe/x) 


P(W  /x)  =  c  ~  1  i  ^  m 

F(W./X)  j  _  p*(e/x)  i  =  m 

(5.9) 

Hence 

CE  P2(W./x)  >  (1  -  P*  (e/x))2  +  P* 

i=l  1  C_1 

and 

1  -  $  P2(W./x)  <  2P*(e/x)  -  — P*2  (e/x) 

(5.  10) 

i=l  1  C  1 

which 

shows  that  P  <  2P*. 

The  previous  developments  show  that  the  nearest  neighbor 
error  rate  is  roughly  bounded  by  the  minimum  possible  error 
rate  P*  (Bayes  error  rate)  and  2P*  expressed  mathematically  as 

P*  <  P  <  P*(2  -  A1  P*)  5.11a) 

—  —  c  - 1 

In  order  to  provide  some  insight  into  the  error  bounds  of 
the  other  nearest  neighbor  rules  under  c  onsideration  (3,  7  and 
10)  the  error  bounds  of  the  k-Nearest  Neighbor  rule  will  now 
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be  considered  for  cases  in  which  k  is  greater  than  one.  This 

rule  classifies  an  observation  x  by  assigning  it  the  label  most 

16  9 

frequently  represented  among  the  k  nearest  samples  '  *  . 


Some  basic  principles  from  the  nearest  neighbor  rule  still 
hold  for  the  k-Nearest  Neighbor  classification  scheme.  Assum¬ 
ing  that  k  is  fixed  and  that  the  number  of  samples  are  allowed 

to  approach  infinity,  then  all  of  the  k  nearest  neighbor  will 

g 

converge  to  x  as  discussed  in  Section  5.  0  (p.  46)  .  The 

labels  of  each  k  nearest  neighbors  are  random  variables  which 

independently  assume  the  value  W.^  with  probability  pfW^/x) 

i  =  1,  2  implying  a  two  class  problem.  The  k  nearest  neighbor 

rule  will  select  W  .  The  probability  of  such  an  occurrence  may  be 
m 

expressed  as: 


K 

£ 

i=(k+l)/ 2 


(  k  )  (P(W  /x)1  [  1  -  P(W  /x)  ]  k_1 
l  m  m 


In  general,  as  k  increases  so  does  the  chance  that  Wm  is  se¬ 
lected.  With  the  same  arguments  that  were  used  in  the  first 
nearest  neighbor  case,  it  can  be  shown  that  if  k  is  odd,  the 
upper  bound  of  the  error  rate  in  a  two  class  problem  for  the  k 
nearest  neighbor  error  rate  is  given  by  C^(P*),  where  is 
defined  to  be  the  smallest  concave  function  of  P*  greater  than 
(k-l)/2  k  i+1  k-1  k-1 

£  (  .  )  ( <p*)  1  (i-p*r  1  +  (p*r  u-p*)  ] 

i=0  1 


i 


+1 

(5.  lib) 


With  the  evaluation  of  C^P*),  the  bounds  of  the  k  near¬ 
est  neighbor  error  rate  are  observed  to  be  as  shown  in  Figure 
5.  0.  1.  Note  that  as  k  approaches  infinity,  the  upper  bound  on 
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? 


the  k  nearest  neighbor  error  rate  converges  to  P*. 

5.  1  Methods  for  Evaluating  the  Probability  of  Misclassification 

At  this  point  the  most  important  development  yet  to  be  made 

involves  the  development  of  a  reliable  technique  for  estimating 

the  performance  of  the  Bayes  and  k-NN  algorithms.  Ideally, 

what  we  would  desire  to  have  at  this  point  is  the  actual  probability 

of  error  P^  obviously  cannot  be  obtained  because  we  do  not  have 

accurate  information  on  the  underlying  distribution,  which  is  a 

result  of  the  fact  that  we  only  have  a  finite  amount  of  samples  to 
7  13 

work  with  '  .  Let  Pg  be  the  best  estimate  of  the  probability 

of  error  P^  which  may  be  obtained  when  one  has  an  infinite  sam¬ 
ple  size  and  uses  one-half  to  train  and  the  other  half  to  test  the 
given  classifier?'  P^  also  cannot  be  obtained  because  by  defini¬ 
tion,  all  the  sample  patterns  will  be  used  to  train  the  classifiers 
and  none  will  be  left  for  testing  them.  In  the  next  section,  some 
of  the  more  important  methods  which  have  been  developed  and 
experimentally  compared  will  be  discussed.  The  particular 
method  used  in  the  analysis  for  this  paper  will  be  discussed  and 
substantiated. 

5.  2  Error  Estimation  Techniques 

Throughout  the  entirety  of  this  discussion,  let  {x  }  = 

{  x  ,x  ,  ....  x  )  be  the  set  of  pattern  samples  at  our  disposal. 

1  ^  IN 

In  other  words,  (x  }  contains  N  patterns. 

The  first  error  measuring  technique  to  be  considered  here 
is  called  the  R-method.  Its  resultant  error  rate  is  denoted 
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as  [FI] .  R  in  this  context  stands  for  redistribution  and  its 
procedure  is  given  in  the  following  steps: 

(i)  The  classifier  is  trained  on  (x  )  . 

(ii)  The  classifier  is  tested  on  {x  }  . 

This  technique  was  developed  during  the  early  stages  of  pat¬ 
tern  recognition,  but  was  more  or  less  put  aside  in  the  light 
of  inadequacies  and  developing  interest  in  generalized  capabi¬ 
lity  learning  machines.  This  new  interest  gave  rise  to  the 
second  method  to  be  considered  in  this  section. 


This  method  is  called  the  H  or  the  Holdout- method  and  its 


resultant  probability  of  error  is  denoted  as  Pg[H].  Typically 

in  this  procedure  one-half  of  the  available  samples  are  used 

for  training  and  the  other  half  for  testing  the  classifiers. 

This  method  may  be  accomplished  through  the  following  steps: 

(i)  Partition  {  x . }  into  two  mutually  exclusive  sets 

{  x  }  and  f  x  )  such  that 
a  p 

{x} 


»  "  tXl'*2 . XN(a)  ! 

*  X,“  “  *N(  «)+J-XN<.«)+2  ’  '  ' 


®  ~m  «)+l"~N(.a)+2 . "N 

Where  N(  a)  =  N/2. 

(ii)  Train  the  classifier  on  (x) 

(iii)  Test  the  classifier  on  (x>  . 

Although  it  is  commonly  the  case,  N(a)  does  not  have  to  be  N/2. 
In  fact,  in  1962,  W.  H.  Highleyman  presented  a  paper  in  which 
he  showed  how  the  set  { x  )  may  be  partitioned  for  various 
values  of  N.  However,  it  has  been  shown  by  other  researchers 
that  his  analysis  is  only  valid  for  very  large  N  when,  in  fact. 
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the  problem  of  estimating  Pg  is  mostly  concerned  with  errors 
associated  with  small  values  N.  In  any  case,  with  rather  fre¬ 
quent  usage  of  this  technique,  frequent  observation  of  discre¬ 
pancies  between  Pg[R]  and  Pg[H]  were  reported.  In  general, 
observation  showed  that  APg(H  -  R)  =  Pg[H]  -  Pg[R]  is  always 
positive  and  inversely  proportional  to  the  data  size  N.  As  it 
turns  out,  P^[R]  is  an  over  optimistic  estimate  of  Pg  and  P^[H] 
is  a  pessimistic  estimate  of  P^  where 
PJRJ  1  Pe  -  PefH] 

The  H  method  was  further  developed  to  increase  its  data  hand- 
7 

ling  efficiency  .  The  data  set  in  this  case  is  divided  into 
mutually  exclusive  pairs,  and  P^fH]  f°r  each  is  calculated. 

The  expectation  operation  is  then  applied  to  the  set  which  re¬ 
sults  in  E  (P  [H]  }. 

6 

This  brings  us  to  yet  another  method  for  estimating  P  . 
This  procedure  is  called  the  U-method  and  its  error  rate  is 
denoted  as  P^fU].  The  method  may  be  accomplished  through 
the  following  steps: 

(i)  Take  one  pattern  sample  x^  from  {  x  } .  Then 

{  x  )  '  *  Xj.Xg,  .  .  .  ,  xN1. 

(ii)  Train  the  classifier  on  {x}  . 

(iii)  Test  the  classifier  on  x..  If  x  is  correctly 

i  i 

classified,  set  n.  =  0,  otherwise  set  n.  =  1, 

l  l 

where  n  acts  as  an  error  indicator. 

l 

(iv)  Do  steps  i,  ii,  iii  for  i  =  1,  .  .  .  ,  N  to  obtain 
values  for  n.  i  =  1,  .  .  .  ,  N. 

(v)  Estimate  Pg[U]  as: 


(5.  11) 
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(5.  13) 


N 

1  T. 

P  IU)  =  AT  '  I  n‘ 
e  N  i=l  l 

This  procedure  is  also  known  as  the  "leaving  one  out"  method 

and  it  may  be  considered  the  most  efficient  error  estimation 

technique  since  it  maximizes  the  information  achievable  from 

the  data.  In  spite  of  its  efficient  use  of  the  data,  the  U- 

method  has  one  obvious  disadvantage.  This  lies  in  the  fact 

that  for  the  evaluation  of  Pg[U]  we  need  as  many  runs  as  we 

have  samples  and  this  might  be  quite  costly  in  terms  of  time 

and  money  when  we  are  considering  a  large  sample.  Asa  result 

of  this  disadvantage,  a  procedure  was  proposed  by  G.  T.  Tous- 

saint  which  reduces  the  amount  of  runs  necessary  and  produces 

an  error  rate  which  is  an  unbiased  estimator  of  P  fUl. 

e 

This  compromise  procedure  is  known  as  the  rotation  or 
7r-method  and  the  steps  necessary  to  implement  this  procedure 
are  as  follows: 

TS 

(i)  Take  a  small  subset  of  pattern  samples  (x)  = 

{  x, ,  x„, .  .  .  ,  x  }  such  that  1  <  P  <  N  and  N/P  is 

1  2  P  ~TR  “ 

an  integer,  P/N  <  h.  Then  {  x }  .  x_,x0,  .... 

i  1  2 


N-P 

(ii)  Train  the  classfier  on  (x  } 


TR 

TS 


(iii)  Test  the  classifier  on  {x  }  .  to  obtain  an  esti- 

l 

mate  of  the  error  probability  denoted  by  P^fx].. 

(iv)  Do  steps  i,  ii,  iii,  for  i  =  1,2 . N/P  such 

TS  TS 

that  {  x }  .  and  t  x  )  are  disjoint  for  i  =  1, 

i  3  - 

(v)  The  resulting  estimate  of  Pg  is  computed  as: 


,  N/P. 
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(5.  14) 


KlPe"»>BN 


N/P 

S  P  [tt). 
.  ,  el  Ji 
1=1 


One  interesting  observation  is  the  fact  that  when  P  =  1  the 
7r-method  becomes  the  U-method  or  when  P  =  N/2,  the  it-  method 
becomes  the  H-method.  This  more  or  less  shows  that  the  jr- 
method  is  a  compromise  between  the  U  and  H  method. 


The  result  of  the  exposition  in  error  estimation  may  be 
summarized  by  the  following  set  if  i)  equalities: 

E  ( P  [R]  }<  P  <  E(P  [U]  <  E{P  [tt]  }<  E{  P  [H]  }  (5.  15) 

e  —  e  —  e  —  e  -■  e 

5.  3  Performance  Measure  Used 

Asa  result  of  experiences  from  preliminary  runs,  the  ob¬ 
servation  of  fluctuation  in  the  optimistic  redistribution  or  R- method 
error  rate  with  sample  size  and  also  from  previous  analysis  that 
showed  that  there  will  be  fluctuation  of  the  error  rate  with  sample 
size,  it  was  decided  that  to  estimate  the  performance  of  the  clas¬ 
sifiers  under  consideration,  five  different  sample  sizes  would  be 
chosen.  The  specific  sample  sizes  decided  upon  60,  135,  255, 

510  and  1005  samples  respectively.  This  choice  of  data  sizes  was 
more  or  less  random,  but  was  chosen  with  the  thought  that  there 
would  be  a  greater  fluctuation  of  the  error  rate  for  smaller  sam¬ 
ple  sizes.  So,  the  pregression  of  the  data  size  was  chosen  to  be 
approximately  2N.  The  classifiers  to  be  tested  with  these  samples 
are  the  k-Nearest  Neighbor  and  Bayes  technique  with  their  various 
variations.  The  result  of  some  of  these  preliminary  runs  which 
influenced  the  decision  about  the  choice  of  data  size  are  presented 
in  Table  5.  3.  1. 
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Bayes  Algorithm 
1024  Samples 


%  Correct /Discriminant  Function 

Ln(p ) 

p(-5) 

(1) 

P 

(2) 

P 

28.91 

25.  98 

50.  68 

74.  32 

(a) 


k-NN  Algorithm 
1024  Samples 

%  Correct/ Discriminant  Function 


1-NN 

3-NN 

4-NN 

5-NN 

6-NN 

7-NN 

8-NN 

9-NN 

10-NN 

100 

99.71 

99.51 

99.  22 

99.  02 

98.93 

98.  83 

98.  83 

98.63 

(b) 


Bayes  Algorithm 
257  Samples 

%  Correot/Discriminant  Function 


Ln(x) 


(.5) 


(1) 


.(2) 


i 


The  most  accurate  available  method  for  evaluating  the 
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error  rates  is  the  U-method  *  .  Its  application  to  this 

experiment,  however,  would  be  quite  impractical  considering 
the  costs  that  would  be  involved.  Over  1,985  runs  would  be 
required  per  test.  Considering  the  time  and  cost  per  run, 
this  technique  had  to  be  eliminated.  The  best  available 
alternative  would  be  the  ir-method  and  it  was  in  fact  the 
one  that  was  chosen.  As  discussed  previously,  its  data 
handling  capability  is  less  efficient  than  the  U-method, 
however,  its  error  rate  P  [n]  is  an  unbiased  estimator  of 

A  V 

Pg[U]  and  the  number  of  runs  required,  could  be  far  less 
than  that  of  the  U-method.  Recall  equation  5.15, 

E{Pe[R]}_<Pe_<  E{  Pe[U]}_<E  PeM  _<  E  Pe  [H]  } 

Where  N  =  the  number  of  samples  and  P  =  the  number  of 
test  samples  per  run.  Table  5.3.2  gives  the  layout  of  each 
group  of  data  used  in  the  experiment.  In  each  case,  P  was 
chosen  such  that  the  ratio  P/N  remains  the  same.  This  was 
done  so  that  the  efficiency  of  the  handling  of  the  data 
would  stay  the  same. 


Now,  for  each  data  set  in  Table  5.3.2,  the  error  rate 
was  determined  according  1 1  equation  5.14  and  presented  in 
Table  5.3.5. 


e  <reM>  - 


15 

Z 

i  =  1 


P  [TT  ]  .  , 
e  L  J  i  ■ 


for  each  variation  of  the  two  classifiers.  Figure  5.2.1 
presents  a  graphical  representation  of  these  error  rates. 
For  further  details  on  the  results,  see  Sub-Appendix  E-C. 


Ef  P  e(m )] 

for  Bayes _ _  P  e  for  KNN 


Run  # 

Ln 

*  * .  5 

KM 

EE1 

3rdNM 

7thNN 

10th!JN 

o\ 

o 

.8500 

.8333 

.8833 

.  8833 

.5166 

.6333 

.6667 

.7667 

135  2 

.7111 

.4518 

.4518 

.4741 

.  0444 

.0889 

.0963 

.1111 

255  3 

.  6823 

•  3059 

.  2980 

.  3298 

.0706 

.0745 

.0549 

.0549 

510  4 

.4059 

•  3392 

.4384 

.4753 

.0429 

.0427 

.0412 

.0510 

'fiSEl 

.2886 

.2336 

.  3537 

.4000 

.0358 

.0328 

.0384 

.0403 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

Table  5-3.3 

Error  Rates  Determined  From  Tests, 
(x,  y.  Amplitude) 
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Contrary  to  the  previous  analysis,  the  k  NN  classifier 
gave  far  better  results  than  the  Bayes  technique.  There  will 
be  further  explanation  of  these  results  in  the  final  chapter. 
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200  400  600  800  1 000 
NUMBER  OF  SAMPLES 

FIGURE  5.2.1  ERROR  CURVES 


6.0  Summary 

Parametric  classification  refers  to  the  development  of  a 

statistically  defined  discriminant  function  in  which  the  urd  er- 

5 

lying  probability  density  functions  are  assumed  known  .  The 
Bayes  rule  minimizes  the  error  associated  with  deciding  that  a 
given  class  is  present  given  an  unknown  sample  x.  The  accur¬ 
acy  of  the  results  given  by  the  Bayes  algorithm  is  dependent 

on  how  representative  the  frequency  histograms  are  of  the  true 

g 

underlying  distribution  of  the  data  under  consideration  . 

For  an  infinite  data  size,  the  larger  k  is  the  more  accu- 

9  14 

rate  will  be  the  results  produced  by  the  k-NN  classifier 

For  a  finite  data  size  choosing  k  too  large  could  give  poor 

results.  A  rule  of  thumb  states  that  k  should  be  at  most  the 

number  of  patterns  in  the  smallest  category  divided  by  five  or 
10 

ten  .  Nonparametric  statistical  pattern  recognition  is 

rather  restrictive  in  the  sense  that  an  optimal  decision  rule 

is  unattainable  since  the  underlying  statistics  of  the  data 

under  consideration  is  unknown  ’  .  Where  P  represents  the 

k-NN  error  rate,  its  performance  is  bounded  from  below  by  the 

Bayes  error  probability  and  from  above  by  at  least  twice  the 

g 

Bayes  error  probability  .  Note,  also,  that  P  <  2P*  is  only  an 

upper  bound  for  the  k-NN  rule  when  k  =  1.  In  fact,  for  an  in¬ 
finite  data  size  as  k  approaches  infinity,  both  the  upper  and 
lower  bound  of  the  k  nearest  neighbor  error  rate  converge  to 
the  Baysian  error  rate  P*. 

The  performance  of  the  various  error  estimation  techniques 
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considered  may  be  summarized  by  the  following  equation; 

EtPe[R1}-  K  -  E{Re[lJ]}-  EfPe[ir]}-E{Re[H]}>  (6.1) 

The  data  that  was  used  for  the  analysis  of  these  algorithms 
is  simulated  radar  ground  clutter  information.  This  should 
pose  a  very  interesting  analysis  for  the  algorithms  because  of 
the  nature  of  the  data.  The  nature  being  that  the  distribution 
of  the  radar  ground  clutter  is  inescapably  dependent  upon  the 
background  that  is  being  scanned.  This  diversity  in  the  dis¬ 
tribution  that  may  be  encountered  due  to  its  dependence  on  the 
background  should  present  a  rather  interesting  test  for  the 
parametric  Bayesian  technique  whose  performance  is  so  dependent 
on  data  distribution.  Also,  this  data  set  will  present  results 
which  will  point  out  some  advantages  of  the  heuristic  distribu¬ 
tion  free  k-Nearest  Neighbor  technique. 

The  results  showed  that  the  k  nearest  neighbor  algorithm 
performed  better  than  the  Bayesian  algorithm  on  all  accounts 
for  this  particular  combination  of  the  data.  Recall  that  P* 
is  used  as  a  standard  by  which  to  judge  the  performance  of 
other  algorithms  since  it  represents  the  ultimate  error  rate 
(the  Bayes  error  rate).  It  seems  ironic  that  since  the  upper 
bound  on  the  k-NN  error  rate  is  only  2P*  that  we  could  now  use 
the  nearest  neighbor  error  rate  as  a  standard  by  which  to 
measure  how  accurate  the  assumptions  about  the  underlying  dis¬ 
tributions  actually  were  in  the  Bayes  algorithm. 

Although  execution  rate  and  memory  allocation  were  not  the 
prime  considerations  in  the  analysis,  we  consider  our  findings 
in  the  areas  to  be  worthy  of  some  recognition.  Table  6,0,1  and 
Figure  6.0.1  summarize  the  memory  requirements  and  the  execution 
time  of  both  algorithms.  Also,  Figure  6.0.2(a)  and  (b)  present 
a  map  of  the  amplitude  categories  and  the  location  of  the 


occurrence  of  errors  indicated  by  a  for  the  best  Nearest 
Neighbor  and  Bayesian  technique  with  255  samples. 


Algorithms 

IBANK 

DBANK 

COMMON  BANK 

TOTAL 

KNN 

2339 

2207 

72 

4618 

BAYES 

2252 

2165 

114 

4531 

Table  6.0.1 

Memory  Allocation  to  the  Two  Algorithms. 


6.  1  Conclusion 


The  performance  of  the  k-NN  classifiers  was  directly  ana¬ 
logous  to  the  analytical  arguments  presented  in  Section  3.  2. 

The  first  and  third  nearest  neighbors  perform  the  best  because 
the  density  of  these  patterns  in  some  categories  were  quite 
small  in  which  case  choosing  k  large  is  somewhat  ambiguous. 

The  various  Bayes  classifiers  used  differ  only  by  the  fact 
that  different  monotonic  nondecreasing  functions  were  applied 
to  the  discriminant  function.  There  was  no  reason  to  suspect 
why  the  performance  of  any  particular  one  of  these  classifiers 
should  be  better  than  the  other.  However,  the  heuristic  con¬ 
clusion  drawn  about  these  decision  functions  is  that,  based 
on  the  results  presented  from  the  version  of  the  Bayes  algor¬ 
ithm  used  in  this  paper  and  the  data  under  consideration,  the 
decision  function  raised  to  the  one- half  power  gives  the  lowest 
error  rate  for  this  algorithm.  In  terms  of  performance  it  is 
followed  by  the  logarithmic,  the  linear  and  the  squared  deci¬ 
sion  function  respectively. 

As  far  as  a  comparison  goes  between  the  Bayes  and  the  k-NN 
classifiers  used,  the  results  indicated  that  the  four  nearest 
neighbor  classifiers,  k  =  1,  3,  7  and  10,  used  give  a  smaller 
probability  of  error  than  all  the  variations  of  the  Bayes  clas¬ 
sifier  considered.  This  is  contrary  to  all  previous  analysis. 

This  may  be  expalined  by  the  fact  that  the  parametric  Bayes  clas¬ 
sifier  considered.  This  is  contrary  to  all  previous  analysis. 

This  may  be  explained  by  the  fact  that  the  parametric  Bayes 


E-64 


i 


algorithm  is  only  as  good  as  the  underlying  assumption  about 
the  distribution  of  the  data.  The  poor  performance  of  the  Bayes 
classifier  is  an  indicator  of  the  fact  that  the  histograms  formed 
in  an  attempt  to  approximate  the  distribution  of  the  data  over  each 
class  was  hardly  representative  of  the  true  distribution  for  the 
various  classes. 

Estimating  the  underlying  statistics  by  means  of  histo¬ 
grams  will  not  necessarily  be  of  much  use  unless  they  contain 
information  on  the  population  of  all  possible  samples.  On  the 
other  hand,  if  we  assume  some  distribution,  there  is  no  guar¬ 
antee  that  it  will  truly  symbolize  the  correct  densities  and 
therefore  it  may  give  arbitrarily  poor  results.  This  presents 
one  more  reason  to  show  that  the  k-NN  technique  may  be  a  more 
practical  one  than  the  Bayes  classifier. 

6.  2  Recommendations  for  Future  Work 

There  is  a  great  deal  of  uncertainty  and  variability  about 
what  is  known  of  the  probability  distribution  of  radar  ground 
clutter.  However,  it  is  quite  obvious  that  the  distribution  of 
radar  ground  clutter  will  be  highly  dependent  upon  the  character¬ 
istics  of  the  background. 

Because  of  the  very  nature  of  ground  clutter,  specifically 
the  variability  of  its  distribution  due  to  background  charac  - 
teristics,  and  equally  important  the  inferior  results  obtained 
from  the  Bayes  algorithm  in  this  analysis,  the  use  of  the  non- 
parametric  k-NN  technique  over  the  Bayes  classifier  is  recom- 


mended  for  the  characterization  of  ground  clutter.  Also, 
because  of  the  fairly  conservative  upper  bound  on  the  k-NN 
classifier,  I  suggest  that  its  results  be  used  as  a  standard 
for  evaluating  assumptions  made  on  the  underlying  distribution 
by  the  parametric  Bayes  technique. 
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SUB-APPENDIX  E-A 
See  Appendix  A 
Bayes  Classifier  Algorithm 
(see  pp.  64-69) 
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SUB-APPENDIX  E-B 
k-NN  Algorithm 
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KNN 

This  routine  performs  the  K-Nearest  Neighbor  classification 
for  category-type  data,  where  K  =  1,3,4,5,6,7,8,9,10.  "Nearness" 
is  defined  on  the  basis  of  the  interpattern  distances. 


WORD  DEFAULT  DESCRIPTION 

NIN  (IORIG)  Input  unit. 


Prerequisites:  The  distance  matrix  must  be  present  on  NIN.  See 
DIST.  Must  have  category-type  data. 

References:  T.  M.  Cover  and  P.  E.  Hart,  IEEE  Trans,  on  Info. 
Theory,  IT-13.  21  (1967). 
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DEFINITIONS  OF  TERMS  FOR  kNN 


1.  1-NN _ _ 

the  category  of  the  pattern  closest  to  the  pattern  being  classified 

(smallest  D.  .  i»H). 

i.3» 

2.  COMMITTEE  VOTES  (K-NN,  K=  3,  4,  5,  6,  7,  8,  9,  10) _ 

the  category  which  is  represented  most  often  in  the  K-closest 
patterns  to  the  pattern  being  classified. 

In  cases  where  two  or  more  categories  are  equally  repre¬ 
sented,  the  category  which  has  the  smallest  sum-of-distances. 

3.  TOTAL  MISSED _ 

a.  TRAINING  SET: 

For  the  given  K-NN,  the  total  number  of  patterns  which  were 
mis  classified. 

b.  TEST /PREDICTION  SET: 

For  the  given  K-NN,  the  total  number  of  patterns  which  were 
not  classified  as  the  category  indicated  (i.  e.  ,  no  distinction 
made  between  TEST  and  PREDICTION  set  patterns.  ) 

4.  PERCENT  CORRECT _ 

a.  TRAINING  SET: 

%  =  (NPAT-#missed)(100.  0)  / NPAT 

b.  TEST /PREDICTION  SET: 

%  =  (NTEST-#missed ){100.  0)/NTEST 


IMPLEMENTATION 


1.  Subroutines: 

KNN 

INPUKN:  input 
MAINKN:  driver 
OUTIKN:  output,  header 
SORTKN:  sorts  out  nearest  10  neighbors 
COMMKN:  committee  votes 
OUT2KN:  output,  pattern  results 
OUT3KN:  output,  result  summation 
INACKN:  interactive  terminal  driver 

2.  Organization: 


OUT1 


K]> 

N 

1 

1 

INPU  MA  IN  IN  AC 


SORT  COMM  OUT  2  OUT3 
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SUB-APPENDIX  E- 
Tabulation  of  Error  Runs  (x, 


y.  Amplitude) 
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Pej[»]  For  60  Samples 


for  Bayes 

P  for  KNN 
e 

Run  # 

**;  5 

*♦1 

■gM 

l8t  NN 

3rdNN 

H|SS 

1 

n 

1.0 

1.0 

'■ 

.  75 

.75 

B 

HI 

2 

D 

m 

D 

m 

m 

mm 

1.0 

1.0 

3 

.50 

.  5000 

1.000 

1.  000 

.2500 

.5000 

.7500 

.7500 

■ 

.5000 

.  5000 

.7500 

.  7500 

.  2500 

.  2500 

.500 

.7500 

5 

1.  000 

1.  000 

.7500 

.7500 

.  5000 

.7500 

1.  000 

1.000 

6 

1.  000 

1.  000 

1.000 

1.  000 

.2500 

.  2500 

.7500 

1.000 

■ 

.  7500 

.7500 

1.  000 

1.000 

.  7500 

.  7500 

.5000 

.7500 

8 

1.000 

1.  000 

1.  000 

L  000 

.000 

1.000 

.2500 

.000 

9 

1.  000 

.7500 

.7500 

.7500 

.2500 

.2500 

.7500 

.7500 

■ 

.75 

1.000 

1.000 

1.000 

1.000 

1.000 

.  000 

.7500 

■ 

1.000 

gg 

1.000 

1.000 

1.000 

1.000 

1.  000 

1.  000 

2 

.7500 

1.  000 

.7500 

.7500 

.  7500 

.7500 

.3 

.7500 

2.  500 

.  2500 

.2500 

.  2500 

.  2500 

.2500 

.2500 

■ 

.7500 

.7500 

.7500 

1.000 

.  2500 

.  5000 

.5000 

.7500 

5 

.7500 

1.000 

1.000 

1.  000 

.5000 

.7500 

1.000 

1.000 

12.  75 

12.  5 

13.  25 

13.25 

7.75 

9.5 

10.0 

11.5 
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P  [r]  For  135  Samples 
ei 


P 

e 

for  Bayes 

P  for  KNN 
e 

Run  # 

Ln 

**.  5 

*+l 

**2 

l8t  NN 

rd 

3  NN 

_ 

7  NN 

10thNN 

1 

.  7778 

.5556 

.4444 

.  5556 

.000 

.  nn 

.  nn 

.3333 

2 

.5556 

.5556 

.5556 

.  5556 

.  mi 

.  2222 

.  2222 

.2222 

3 

.4444 

.  5556 

.4444 

.5556 

.  nil 

.2222 

.000 

.000 

4 

.8889 

.6667 

.6667 

.7778 

.  000 

.  nil 

.  nn 

.  1111 

5 

.  4444 

.4444 

.4444 

.4444 

.  2222 

.  2222 

.2222 

.2222 

6 

.7778 

.4444 

.4444 

.4444 

.000 

.000 

.000 

B 

7 

.  5556 

.  1111 

.  2222 

.  2222 

.000 

.  nn 

.2222 

8 

.5556 

EB 

.3333 

B 

.nn 

.  1111 

.  till 

B 

9 

.7778 

.  8889 

.  8889 

.8889 

.  000 

.000 

.000 

B 

10 

.8889 

■ 

.4444 

.000 

Pi 

.000 

a 

11 

.  6667 

a 

.  4444 

g 

.  000 

.000 

.  1111 

B 

12 

H9j 

.2222 

B 

.000 

.  1111 

m 

fl 

13 

.7778 

.  3333 

B 

.  nil 

.  nn 

B 

B 

14 

b 

5556 

.  000 

.000 

B 

■ 

■■ 

15 

■ 

3333 

.3333 

3333 

.000 

.000 

.000 

000 

10.6669  6.7777  6.7775  7.1111  .6666  1.3332  1.4443  1.6665 


E-76 


PCj  [?r]  For  255  Samples 


for  Bayes 

Ln 

**.  5 

**1 

**2 

.4706 

.  2941 

.  3529 

.  3529 

.  5294 

.  2941 

.  2941 

.4706 

.  5294 

.  3529 

.  4118 

.  4706 

.  5882 

.4118 

.  4118 

.4118 

.  5882 

.  3529 

.  2941 

.  2941 

.8824 

.2941 

.  2941 

.  2941 

.7059 

.  3529 

.2941 

.  3529 

.8235 

.5924 

.6471 

.6471 

.7647 

.  1765 

.  1765 

.  2353 

.8235 

.2941 

.1176 

.  1176 

.  7059 

.2353 

.  2353 

.  2941 

.7647 

.  2353 

.  1765 

.  1765 

a 

.  3529 

.  2941 

.  2941 

5882 

.  1785 

.  1765 

.  1765 

7647 

.  2353 

.2941 

.  3529 

10.235 

4.5881 

4.  4706 

4.9464 

1>  for  KNN 


,000  . 0588 
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Pe.  [ir|  For  510  Samples 


Pq  for  Hayes 


Itun#  l.n  **.  5  I  **l  *+2  1  NN 


P  for  KNN 
e 


1  . 2941  .2353  .3235  .3824  .000  .0588  .0588 


5294  . 3529  . 5000  .  5000  .  0882  .  0588  .  0882 


3  .0882  .0294  .0588  .1471  .0294  .0882  .0882 


.1176  .4118  .5000  .4706  .000  .0294  .0294 


5  .4706  .2647  .5294  .6174  .0588  .0296  .000 


6  .2941  .2941  .5382  .6176  .0882  .1176  .1176 


0588  .2353  .2941  .3235  .0294  .0296  .0296 


8  .2059  .4118  .3824  .4412  .0296  1  .0588  .0298 


9  .2941  .3826  .5588  .7353  .0882  .0588  .0588 


10  .7059  .4706  .6176  .6671  .1176  .0588  .0296 


11  .7941  .7059  .6671  .6671  .0822  .0296  .000 


12  .7059  .4706  .5000  .4118  .000  .000  .0588 


3  .6765  .4706  .4412  .4706  .0296  !  .0296  .0296  .0264 


14  .7353  .2059  .4612  .4612  .000  .000  .000 


IS  .1176  .1471  .1765  .1765  .000  .000  .000  .000 


C.  0881  5.0884  7.5764  7.1296  .6438  .6408  .6174  .7644 


Pe^  [a- J  For  1005  Samples 


|  Pg  for  Bayes 

P  for  KNN 
e 

Run  # 

Ln 

**.  5 

*+l 

**2 

lStNN 

rd 

3  NN 

7thNN 

l0thNN 

1 

.4328 

.  2537 

.  3731 

.  3731 

.  1196 

.  1363 

.  1343 

.  1343 

2 

.  2090 

.  1642 

.  3433 

.  3433 

.  0668 

.0597 

.  0746 

.  0746 

3 

.  3433 

.  3134 

.4328 

.4925 

.  0597 

.  0597 

.  0468 

.  0597 

■ 

.  1950 

.  2985 

.  2985 

.  2985 

.  0448 

.  0448 

.  0299 

.  0448 

5 

.  3731 

.  3285 

.4328 

.4776 

.  1049 

.  0169 

.  0229 

.  0229 

6 

.  1045 

.  2090 

.2836 

.  3134 

.  0896 

.  0766 

.  0896 

.  0766 

■ 

.4179 

.  2537 

.5274 

.5522 

.  0448 

.  0149 

.  000 

.  0149 

8 

.  1642 

.2388 

.2537 

.2537 

.  0299 

.  0299 

.  0296 

.  0468 

9 

.4030 

.  2836 

.5821 

.6507 

.  0149 

.  000 

.  0169 

.  000 

10 

.  1363 

.  2985 

.2985 

.  2985 

.  000 

.000 

.  000 

.00 

11 

.4179 

.  1306 

.  4058 

.  4627 

.  0299 

.  0299 

.  0296 

.  0299 

12 

.  1940 

.  1791 

.  2233 

.  2537 

.  0149 

.  0149 

.  0468 

.  0468 

13 

.4030 

.  1791 

.  2388 

.  3286 

.  0149 

.  0169 

.  0149 

.  0149 

14 

.  1065 

.  1960 

.  2388 

.  2687 

.  0149 

.  000 

.  0448 

Bfl 

■ 

.4328 

.  1791 

.  3731 

.  5226 

_ 

.  000 

.000 

.000 

&■ 

4.3283  3.5035  5.3062  5.9999  .5374  .4925  .5753  .605 
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of 

Rome  Air  Development  Center 

RAVC  pla.ru>  and  executes  r esearch,  development,  test  and 
&  elected  acquisition  programs  in  support  of  Command,  Control 
Communications  and  Intelligence  [C3 1)  activities.  Technical 
and  engineering  support  within  areas  of  technical  competence 
rs  provided  to  ESV  Program  Offices  ( POs )  and  other  ESV 
elements.  The  principal  technical  mission  areas  are 
communications ,  electromagnetic  guidance  and  control,  sur- 
vuiiance  o f  ground  and  aerospace  objects,  intelligence  data 
collection  and  handling,  information  system  technology, 

- ionospheric  propagation,  solid  state  sciences,  microwave 
physics  and  electronic  reliability,  maintainability  and 
compatibility. 


